








With Keras and Pylorch 


Beginning Anomaly 
Detection Using 
Python-Based Deep 
Learning 





Sridhar Alla 
Suman Kalyan Adar 


Apress’ 


Beginning Anomaly Detection Using Python-Based Deep Learning: 
With Keras and PyTorch 


Sridhar Alla Suman Kalyan Adari 
New Jersey, NJ, USA Tampa, FL, USA 
ISBN-13 (pbk): 978-1-4842-5176-8 ISBN-13 (electronic): 978-1-4842-5177-5 


https://doi.org/10.1007/978-1-4842-5177-5 


Copyright © 2019 by Sridhar Alla, Suman Kalyan Adari 


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the 
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, 
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information 
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now 
known or hereafter developed. 


Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with 
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an 
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the 
trademark. 


The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not 
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to 
proprietary rights. 


While the advice and information in this book are believed to be true and accurate at the date of publication, 
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or 
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the 
material contained herein. 


Managing Director, Apress Media LLC: Welmoed Spahr 
Acquisitions Editor: Celestin Suresh John 
Development Editor: Laura Berendson 

Coordinating Editor: Aditee Mirashi 


Cover designed by eStudioCalamar 
Cover image designed by Freepik (www.freepik.com) 


Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 
6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer- 
sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member 
(owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a 
Delaware corporation. 


For information on translations, please e-mail rights@apress.com or visit www.apress.com/ 
rights-permissions. 


Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and 
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales 
web page at www.apress.com/bulk-sales. 


Any source code or other supplementary material referenced by the author in this book is available to 
readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-5176-8. For more 
detailed information, please visit www.apress.com/source-code. 


Printed on acid-free paper 


Table of Contents 


POUL TEN LOE Sssstiscnateciesciienicsts mnie sis ese aren eine IX 
About the Technical Reviewer's ........::ssssssssssssssssssescccnssessensccnnssnsssescnncensssesennnensssnseas xi 
PC WING TN Na sink inser cecteaticics cc ceudsiscrentciecernlcscatauiewennesdcccaiwsd tacesbictsonesicnecmtenes xii 
(0 XV 
Chapter 1: What Is Anomaly Detection? .........cccssssscccssssseecncssseeecnnseeeenecnseeeeeenseeeenenns 1 
OT cca risecritietdatice arene niniceendunanienienea 1 
Pe le rete tniotecseter ns cen nse eee renee 1 
PIMIAGS Bo ald POU ecectciceitteivientrrenreenineeeeeiel 5 
PTTANCS BGA TUT OS weicesiceceresdiveseunasnnceinesdveanwssinesviesdusaunnsiiveneusaditnsanssionssueniiaienaluenmedieins 9 

UN Sel accent ucteeaiesecapakuebetectuan Anat ieieiaetvent east viageuansatieencpetignarenesavieniiae aatiaarienueuA einen 11 

LO ONTOS OT AOI ANG secsasvessetrsedsaesnzsesasstereasiissesers aur niaaasiedaa ieee 15 
Date PON ASeU ANONIANES civicinietieiicunniicetniininieiipatin 16 
CONTA BAG OU AIO IIAG a ciccceniessvsnneadieseintaiinesmnndnesaiunsiecnncaduessinessievawesinednusadsecaenaluesmmasincamenivess 16 
Fe Fe rrtireetie attire eros serene eee 17 
PTI EFT ssscresvtclecencasehne earsssitewecesct vesiceinienestaictentrvncavaeeduseidwnyasienetaddnaetalenentenawaresaewenaaieenss 17 
Oe Usa eee cae  eecsc i ese e eee 18 
(Cg 111005 | sees re a ee or se re ee re one nee nee ee eres 18 
OS LC | ee a ee a ee aa ee eee ren eee re ere 18 

The Three Styles of Anomaly Detection .............ssccccsssesssscecsseecsseeeseseeeeeseneesseneeesenesesseeeessnsnseees 19 
Where Is Anomaly Detection USC? ............cccssssssssscsesescresssoressscesuscusseceuerecenesecaeesscueuersueneresens 20 
DUE GL) (cr: Us | En ne nee Se ee ner ere eee nee eee eee eee ere 20 
COU @ 1: | a a ee eee. ee eee ee oan eee 21 
Ele ae fee ae ee ene Ae ee ae ee ee ae ee ee ae TS 21 


il 


TABLE OF CONTENTS 


ALCL 4 1,4 [BPR ne Bese i Bn i Penne ne er Ee Ce PE ree ese oe ene ee 22 
Fe crceececc eee ecseecneue oneness oaextasetoa reso onedeasea ness aacewenetenseadaraeestesaneneeemenemein 22 
BET aa NE et ANNE ssc te cdtccecensecutde dict eva ages dace dapsone ielieedi es ea eae aeaeean 23 
| en ee re rE mee ee ee Ce een ee Rene ese ener ere renee 23 
Chapter 2: Traditional Methods of Anomaly Detection .........:sscccsssssseccssssseeensenseeeens 25 
ebee SCIOIICG TE VIOWy riecasisssreigairdeasiaisaaciitariecsinweeiinraennitasatinadenaineesaeetaaatinentinaes 25 
ETAT Orie al accel nus pec eaces cena oars caaippstigasoaeginied pasa sieaemtueca ai ettalencenma cei denmL a ecRnanaMieNE 34 
TENET WI lisansaseieenseanisaeiaere-eahiseciserteahasabinaceaanietiiaatiastianniet anes ieaneanann aime Maaaemataaanial 34 
Anomaly Detection with Isolation FOrest.............ccccccsscesseesseesseeesseesseeessseeeseeeessenssensseneas 36 
One-Class Support Vector Machine...........ccccccsessesssseessessseeeessessceeeessesseeeeseessaeeeesesssaeeeevesssgeneeeas 51 
PROMALY DETECTION WU SW saisasiceientesvesenesduesainscscecnweaiuecensassessnerdssaumeaavelelenaduesvotadveumnedons 63 
S11) |g Sane rrr an ea ve Teer RTO rr AN De rere Ae ere ne een eee rere ene 71 
Chapter 3: Introduction to Deep Learning... ....cccsssssscccsssssecccccsseeeecnseeeencsseeeneenseesens 73 
What Is Deep Learning? ........ccsccsesesstessscresscccesssccusrssenesscenesscenesacaeesssueusseceuerecnesacazesacaeusnsueuerenens 73 
PAT TATIGUGL INGUIN CTI ON IS csiiscu scenastscciensiiossiresasins vein stnutaes ties eaentnnsilesstiianonapleuseesuiaunanansenseaschinsceus: 74 

Intro to Keras: A Simple Classifier MOE] ...........scccccsssssesssssseessssceessssceeesesseneesesceesssseneseesseneesesaees 84 
Intro to PyTorch: A Simple Classifier MOdel............:::ccsssccsssccssssessssneesssressssesesseeesseesessenenssnens 111 
OE fe re en ee ere eee ene Sern eer en ee eT 122 
Chapter 4: AUTOCNCOESS ......cccsssssecccsnssecccccsseeceecnseeecencseeeeeensseeeeunnseeeenecseeeenensseeeess 123 
What Are AUTOCNCOCEIS? ..........ccccsccssseesseessecsseesseeeseesseeesaeesseeesaeesseeesaeessaeesesesaeesseeesaeesseeeeaees 123 
Be UG a eee acc ec dse ee 125 
Bate Oe ec innecieonnnenaenanewanwin aia 140 
BEI ve iacsaes serecsecnea neseenssnnacentatatuespeneectees nies isan weeetapeaselda umn Pass nnenianegeee ee 142 
Re cercnirencieniveandennanisartnseneeeaeenenerncen renee 144 
DENOISING AUTOCNCOCESS .......scccsscsssessesecsessnsenssscnseeenscnensensanensaeansenensaneneansanensanansenensanensensazensanas 153 
PeUE LUT TG UFZ ICL LE (1810 | ol pc ie peneete le inl re beeing rr tr Pei ying renee Heeler nn reent wrsersrrat erwin rretorrts eerie restr rte 163 
SULT LILES eee noe Bee PSE a ne pre ERT Sree Herts rere ve ny ny een reat re nen eRe ap enn erry ene rrr ern Tener 178 


1V 


TABLE OF CONTENTS 


Chapter 5: BoltZMann Machines... .....ssssccccsssssecccscsseeccccsseeceecnseeececnseeececsseeeennsessens 179 
What Is a Boltzmann Machine? ............:ccccssesseessseesseeeseeesseesseeesseesseeesseesseeessesesaeeesesesensseeeeaees 179 
Restricted Boltzmann Machine (RBM) ........scsscsscscsscscsssssssenssscnsessnscneesensenensavcnsensnsensesensanenranas 181 

Anomaly Detection with the RBM - Credit Card Data Set............cccsessssssssstesesseeesssseenee 187 
Anomaly Detection with the RBM - KDDCUP Data Set...............ccscssssssssssssseeessseeeessenenssees 197 
Be eit tree eect ar ee ee enna 212 

Chapter 6: Long Short-Term Memory Models.........cccssssscccsssseecccsnsseecescsseeenecsseeeens 213 
SEQUENCES ANC TIME SerieS ANALYSIS..........::cccesscccsscecsssseessseeeesseesesseeeesseneeeseneesseneeesseesessenensees 213 
LD or ILE |) |. ati ener Seb cen we Brn Str Dnt re ORE: Oper O URE ok EP en Pretetr Our eer emrenre er Uelwen tenet eeaer 216 
Fe Ue meteeceseeciscen scission 218 
Nee et ne ts ceeteaee aoae ann eeeaeeeueen 223 
Examples Of Time Series........ccccscsscssssssessscccsesscsceessscceeeessccnsesecceeeesecaneesescensesesaeesesecanuenessonsenesas 243 

BT oases eee serene gcins ececuesaiccecpteticsnceesalutuns au apecnstasioos atiaeeoeeusneensdouevecseranaeases 243 
ET TANS TAOIIIY cas scascosisrassosisvanceisaisnaiaioeds tie tered ae aes eee 244 
ieee se eee 246 
BIT TIA DSS. SOUS | AIS siiisscvsiciinectnciwcreinetdeninienaeehddeieeeinieavanetiden 248 
UE WO COT SRS ica ccoicccscoetecsnxcsncicceprneceaceaveanicaoteciaciakesieseevadenneesdanreucsineseceneeeeses 250 
ambient_temperature_SySteM_failUIe...........ccsccccssseseessseeessseeeeeesseneeesseneeeesseneeeessneeessaees 251 
Be eT  eecceavecesneeanssensaszses a tsseteanieeeencmaeesssnasaeeaesannnaeanen 253 
Fe WR cheese earns getaec aces nies pecan asic oleas eid aeledetdaciensadibameniialapeees 254 
Be sesh etcsapie cnc praecetobin cc ciececi vee ela pnasteou tienda nedinbnbendabdnestabouanedecinaaniaiuapeesbaneesieeeneeteenioeniel 256 

Chapter 7: Temporal Convolutional Networks .......ccccsssscsccsssssecccsnsseeeescsseeeneenseeeens 257 
What Is a Temporal Convolutional N@tWOrk?.........ccccccsssccsssssssssssceesssseeeesesceeesesseneesssceeenessneens 257 
Dilated Temporal Convolutional N@tWOFK ..........ssccccssscsssssscesssseeesssseeeesssseesesseneesessnenseseenenenas 262 

Anomaly Detection with the Dilated TON.............:ccssscccsssccsssresssseesssreesseneesssneessseeeessenenssaes 267 
Encoder-Decoder Temporal Convolutional N@twork..........cscccssssscesssscsessssseesssseeesesseeeesssceeneeas 283 
Anomaly Detection with the ED-TON .............:ccssssssssessseessseeesssensesseneesseneesssneeesseesessensnssees 286 
et iva eas seswwictssvs es cee ee esi ee eee 295 


TABLE OF CONTENTS 


Chapter 8: Practical Use Cases of Anomaly Detection... .......:sccccssssseccsssssseeesenseeeens 297 
PUTT ULE (515715 0] | ero ne ae ee ae nea ae ee a eS mee ee ee ee ny mae ee ae ee eee 298 
Real-World Use Cases of Anomaly Detection.............cccccssscssssssssressssreesseeessseeesseeessseneessenens 299 

Oe AIT ch sien ecpr tei bs ects aienineyieeindaneeaneentracn aia lad ning iene arena miecndans 300 
Bol) 4s ee ae ek ee ee ee eee 302 
Be arcetcteciarna cantvs oy ic ned vss oeay ae sduicaioieistvaeraeniaeatesvuevedvesaiceneioseeiniwierinineiraniaaatie 303 
(CLI 6.) ieee eo yee te Pe RCS ne ce POPE ePE ry Peon Ee mene re ee tetra 304 
TP AITSOOT TATION, csccdudcszspacaptectotssavctevapsedvinieiensseadeio eis aii atnraaies see 306 
0 (| ee ee ee eee Sere ee eee en reeeee 307 
Be Aree te cic cccessieeccaeveciodasrvaciecieseceiees snes eae eens eieanddacenneernceed 308 
Be IRs escape ices doesn ticecad snanneeseunieedsnciodecseotaseid ti decsaiscpsel necunserseenieweoenreteces 309 
0 SE casita ens ea ena 312 
JEU le [| ec ese ee ee eee a eee eae eee 313 
I EE pace cn tected ceeaeeaecnrsieseevereeverss enndier eed aviaiennineeneteeniere ements 315 
CLE) [en ee ae ae Pe Pe Pee Pm TOME Peer oe er ene ae ee rae mE eae ree eer ere eee eee tear ee tere 315 
Implementation of Deep Learning-Based Anomaly Detection .............:ccsssccsssreeesssereeessseeeeess 316 
Be chara pena ahaha saccades shade esas da ets ge sn taken shea mabibnse hci bares sea daiaaeeneannees 317 

Appendix A: Intro tO K@raS......csssssccsssssccsesseccssseccncnseecnssecncnseecnsseennseeenenssennnseseneas 319 
(CCU 8 Cf a cee nr ek ens ees ees ees ees ene eee cee ee eee 319 
Ue Ee a0 acccoectdcec eecsobare pees vce ns its cen wre essed tension sccm dnc to sucess teoeseuare densest 320 

a acces teeter sec acevesc teres ascc rasiededeseiees us esteeeceessesees eadeeesesaesatenemasaeess 321 
Model Compilation and Training ...........:csccccsssccsssrecsssreessseeeesseeeesseesssseneesseneesssneeessesessensneas 322 
MIGGEL EVAINALION ANG PTSUICUON sccsxiscscstenricntiartnneineananeeanacaaemens 326 
[2 ee ea eee vee my eee me: Ae eee er a See Oe ee a eo aT eee eee 328 
ey coca | (ie) 1, | ee enn ne rn Be enone ee nee Ber. Ue PPL ner ey ere ey eee ere ree 340 
(LSU 6 SAB n ie ee tr st ne Pree nner ste Beles O yr er St peak Ne Pre arn ae te ue reer arr r corn enres etree aertre 343 
Re actos becca eatesveesaieavaeactesataaesns bettas eee tienen nee 345 
Pe REN ceca chee sreaciere sive eserves cece ne igieciccect a ceva ca cscs cnn aaaneenednieend tes 348 
Fe cess ocstsced seteecy cass sends caseesecyecatssasdeaniovsnssctesictasigaescaseeesd exe viceseabseatsdtlacseseuueateueiadsssteeeens 351 
Back End (TensorFlow Operations).........cccssccsessssesescsesssccesssoesrsseuereseuereceresesaesssaeesnecnerens 358 
erga etreceeseecs eae eces cnn entrees ee eo ne ecneoe senna ones ree ease oes 360 


v1 


TABLE OF CONTENTS 


RODRIG 10 FV OIC Nitin 361 
PAN TS vce cies eietacidesdcn trie cde oie ereavedacose savers eeiceerdeeieraneeiereseenetaenaente 361 
UP te 0 aaciaas sania sencauea vaasaciaanarerianaesania essa diasewiereauausarneisar caine ennai 362 

SEOUEI Ia! Ve. WICUUIELISN iccntictnarierndvicnnneericieeadniamrieneateeiienns 376 
2) en ek enc een ene eee ee eS ee eee eee Rem eee eee 377 

Ecc 11) 1610/1, | 2 hi nee nee aon one eee ee ne ee esr nner ee rE Ae eee rene: Pee eee ea 387 

8) UC) IR ea PO EOD enor ee eo S BURNT Tepe ee ey RIC Seve NaY Uy mT Foye ni ey meer me 390 
Temporal Convolutional Network in PyTOIrch............:ccssccsssscssscecsssresssseeessseeesssneessseeeesseesssnenens 392 
Dilated Temporal Convolutional N@tWOFKk.........c:ccssssesssseseesssscsessssceesesseeesssseeesssneeenessneess 393 
SERIA OR caspases apoaslantesaiteamesasspcantpasonibadoncased secatabanesauaniiaassie aenuasncbnlaasadnncanpsdenabbeuueeal aaaienasiedas 408 
0) (le ee 409 


Vil 


About the Authors 


Sridhar Alla is the co-founder and CTO of Bluewhale, 
which helps organizations big and small in building 
AlJ-driven big data solutions and analytics. He is a published 
author of books and an avid presenter at numerous Strata, 
Hadoop World, Spark Summit, and other conferences. He 
also has several patents filed with the US PTO on large-scale 
computing and distributed systems. He has extensive 





hands-on experience in several technologies including 
Spark, Flink, Hadoop, AWS, Azure, Tensorflow, Cassandra, 
and others. He spoke on anomaly detection using deep learning at Strata SFO in 
March 2019 and will also present at Strata London in October 2019. 

Sridhar was born in Hyderabad, India and now lives in New Jersey with his wife, 
Rosie, and daughter, Evelyn. When he is not busy writing code he loves to spend time 
with his family; he also loves training, coaching, and organizing meetups. 

He can be reached via email at sid@b luewhale.one or via LinkedIn at www. Linkedin. 
com/in/sridhar-a-1619b42/. Please visit www. bluewhale.one for more details on how 
he could help your organization. 


Suman Kalyan Adari is an undergraduate student pursuing 
a B.S. in Computer Science at the University of Florida. He 
has been conducting deep learning research in the field of 
cybersecurity since his freshman year, and has presented at 
the IEEE Dependable Systems and Networks workshop on 
Dependable and Secure Machine Learning held in Portland, 
Oregon, USA in June 2019. 


He is quite passionate about deep learning, and specializes in its practical uses in 





various fields such as video processing, image recognition, anomaly detection, targeted 
adversarial attacks, and more. 

He can be contacted via email at sumank. adari@yahoo.com or at sadari@ufl.edu. 
He also has a LinkedIn account at www. linkedin. com/in/suman-kalyan-adari/. 


About the Technical Reviewers 


Jojo Moolayil is an artificial intelligence professional and 
published author of three books on machine learning, deep 
learning, and IoT. He is currently working with Amazon 
Web Services as a Research Scientist - A.I. in their 
Vancouver, BC office. 

He was born and raised in Pune, India and graduated 
from the University of Pune with a major in Information 
Technology Engineering. His passion for problem solving 
and data-driven decision making led him to start a career 





| iy with Mu Sigma Inc., the world’s largest pure-play analytics 
provider, where he was responsible for developing machine learning and decision 
science solutions for large complex problems for healthcare and telecom giants. He 
later worked with Flutura (an IoT Analytics startup) and General Electric with a focus on 
industrial A.I in Bangalore, India. 

In his current role with AWS, he works on researching and developing large scale 
A.I. solutions for combating fraud and enriching the customer’s payment experience in 
the cloud. He is also actively involved as a tech reviewer and AI consultant with leading 
publishers and has reviewed over a dozen books on machine learning, deep learning, 
and business analytics. 

You can reach Jojo at 


e www. jojomoolayil.com/ 
e www. linkedin.com/in/j0j062000 


e https://twitter.com/j0j062000 


ABOUT THE TECHNICAL REVIEWERS 





Satyajit Pattnaik is a Senior Data Scientist with around eight 
years of expertise in the field. He has a passion for turning 
data into actionable insights and meaningful stories. Right 
from the data extraction until the final data product or 
actionable insights, he enjoys the journey with the data. 

He is a dedicated and determined person who can 
adapt to any environment, which is quite evident from the 
cross-domain projects involving different types of data, 
platforms, and techniques he has worked on. Apart from 
the skills related to data capture, analysis, and presentation, 
he possesses good problem solving skills. Being from the 


computer science field is really an add-on to do things quickly and in a reusable manner. 


Along with machine learning, he believes in quick learning as and when needed. 


Acknowledgments 


Sridhar Alla 

I would like to thank my wonderful, loving wife, Rosie Sarkaria, and my beautiful, 
loving daughter, Evelyn, for all their love and patience during the many months I spent 
writing this book. I would also like to thank my parents, Ravi and Lakshmi Alla, for their 


blessings and all the support and encouragement they continue to bestow upon me. 


Suman Kalyan Adari 

I would like to thank my parents, Krishna and Jyothi, and my loving dog, Pinky, for 
supporting me throughout the entire process of writing my first book. I would especially 
like to thank my sister, Niha, for helping me with graph creation, proof-reading, editing, 
and testing the code samples. 


xii 


Introduction 


Congratulations on your decision to explore deep learning and the exciting world of 
anomaly detection using deep learning. 

Anomaly detection is finding patterns that do not adhere to what is considered as 
normal or expected behavior. Businesses could lose millions of dollars due to abnormal 
events. Consumers could also lose millions of dollars. In fact, there are many situations 
every day where people’s lives are at risk and where their property is at risk. If your bank 
account gets cleaned out, that is a problem. If your water line breaks, flooding your 
basement, that’s a problem. If all flights get delayed in the airport, causing long delays, 
that’s a problem. You might have been misdiagnosed or not diagnosed at all with a 
health issue, which is a very big problem directly impacting your well-being. 

In this book, you will learn how anomaly detection can be used to solve business 
problems. You will explore how anomaly detection techniques can be used to address 
practical use cases and address real-life problems in the business landscape. Every 
business and use case is different, so while we cannot copy-paste code and build a 
successful model to detect anomalies in any dataset, this book will cover many use cases 
with hands-on coding exercises to give an idea of the possibilities and concepts behind 
the thought process. 

We choose Python because it is truly the best language for data science with a 
plethora of packages and integrations with scikit-learn, deep learning libraries, etc. 

We will start by introducing anomaly detection and then we will look at legacy 
methods of detecting anomalies used for decades. Then we will look at deep learning to 
get a taste of it. 

Then we will explore autoencoders and variational autoencoders, which are paving 
the way for the next generation of generative models. 

We will explore RBM (Boltzmann machines) as way to detect anomalies. Then we'll 
look at LSTMs (long short-term memory) models to see how temporal data can be 
processed. 

We will cover TCN (Temporal Convolutional Networks), which are the best in 
class for temporal data anomaly detection. Finally, we will look at several examples of 


anomaly detection in various business use cases. 


INTRODUCTION 


In addition, we will also cover Keras and PyTorch, the two most popular deep 
learning frameworks in detail in the Appendix chapters. 

You will combine all this extensive knowledge with hands-on coding using Jupyter 
notebook-based exercises to experience the knowledge first hand and see where you can 
use these algorithms and frameworks. 

Best of luck and welcome to the world of deep learning! 


Xv1 


CHAPTER 1 


What Is Anomaly 
Detection? 


In this chapter, you will learn about anomalies in general, the categories of anomalies, 
and anomaly detection. You will also learn why anomaly detection is important and how 
anomalies can be detected and the use case for such a mechanism. 

In a nutshell, the following topics will be covered throughout this chapter: 


e What is an anomaly? 
e Categories of different anomalies 
e What is anomaly detection? 


e Where is anomaly detection used? 


What Is an Anomaly? 


Before you get started with learning about anomaly detection, you must first understand 
exactly what you are targeting. Generally, an anomaly is an outcome or value that 
deviates from what is expected, but the exact criteria for what determines an anomaly 


can vary from situation to situation. 


Anomalous Swans 


To get a better understanding of what an anomaly is, let’s take a look at some swans 
sitting by a lake (Figure 1-1). 


© Sridhar Alla, Suman Kalyan Adari 2019 
S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_1 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 





Figure 1-1. A couple of swans by a lake 


Say you want to observe these swans and make assumptions about the color of the 
swans. Your goal is to determine the normal color of swans and to see if there are any 
swans that are of a different color than this (Figure 1-2). 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 





Figure 1-2. More swans show up, and they're all white swans 


More swans show up, and given that you haven't seen any swans that aren’t white, 
it seems reasonable to assume that all swans at this lake are white. Let’s just keep 
observing these swans, shall we? 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 





Figure 1-3. A black swan appears 


What’s this? Now you see a black swan show up (Figure 1-3), but how can this be? 
Considering all of your previous observations, you've seen enough of the swans to 
assume that the next swan would also be white. However, the black swan you see defies 
that entirely, making it an anomaly. It’s not really an outlier where you could have a 
really big white swan or really small white swan, but it’s a swan that’s entirely a different 
color, making it the anomaly. In this scenario, the overwhelming majority of swans are 
white, making the black swan extremely rare. 

In other words, given a swan by the lake, the probability of it being black is very 
small. You can explain your reasoning for labeling the black swan as an anomaly with 
one of two approaches, though you aren’t just limited to these two approaches. 

First, given that a vast majority of swans observed at this particular lake are white, 
you can assume that, through a process similar to inductive reasoning, the normal color 
for a swan here is white. Naturally, you would label the black swan as an anomaly purely 
based on your prior assumption that all swans are white, considering that you've only 
seen white swans thus far. 

Another way to look at why the black swan is an anomaly is through probability. 
Assuming that there is a total of 1000 swans at this giant lake with only two black swans, 


4. 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


the probability of a swan being black is 2/1000, or 0.002. Depending on the probability 
threshold, meaning the lowest probability for an outcome or event that will be accepted 
as normal, the black swan could be labeled as anomalous or normal. In your case, you 


will consider it an anomaly because of its extreme rarity at this lake. 


Anomalies as Data Points 


Let’s extend this same concept to a real-world application. In the following example, 
you will take a look a factory that produces screws and attempt to determine what an 
anomaly could be in this context. The factory produces massive batches of screws all 
at once, and samples from each batch are tested to ensure that a certain level of quality 
is maintained. For each sample, assume that the density and tensile strength (how 
resistant the screw is to breaking under stress) is measured. 

Figure 1-4 is an example graph of various sample batches with the dotted lines 


representing the range of densities and tensile strengths allowed. 


Density 





Tensile Strength 


Figure 1-4. Density and tensile strength in sample batches of screws 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


The intersections of the dotted lines create several different regions containing data 
points. Of interest is the bounding box (solid lines) created from the intersection of both 
dotted lines since it contains the data points for samples deemed acceptable (Figure 1-5). 


Any data point outside of that specific box will be considered anomalous. 


Density 








Tensile Strength 


Figure 1-5. Data points are identified as good or anomaly based on their 
location 


Now that you know what points are and aren’t acceptable, let’s pick out a sample 
from a new batch of screws and check its data to see where it falls on the graph 


(Figure 1-6). 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


Density 





Tensile Strength 


Figure 1-6. A new data point representing the new sample screw is generated, 
with the data falling within the bounding box 


The data for this sample screw falls within the acceptable range. That means that this 
batch of screws is good to use since its density and tensile strength are appropriate for 


use by the consumer. Now let’s look at a sample from the next batch of screws and check 


its data (Figure 1-7). 


CHAPTER 1 =WHAT IS ANOMALY DETECTION? 


Density 





Tensile Strength 


Figure 1-7. A new data point is generated for another sample, but it falls outside 
the bounding box 


The data falls far outside the acceptable range. For its density, the screw has abysmal 
tensile strength and is unfit for use. Since it has been flagged as an anomaly, the factory 
can investigate the reasons for why this batch of screws turned out to be brittle. For a 
factory of considerable size, it is important to hold a high standard of quality as well 
as maintain a high volume of steady output to keep up with consumer demand. For a 
monumental task like that, automation to detect any anomalies to avoid sending out 
faulty screws is essential and has the benefit of being extremely scalable. 

So far, you have explored anomalies as data points that are either out of place, in the 
case of the black swan, or unwanted, in the case of faulty screws. So what happens when 


you introduce time as a new variable? 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


Anomalies in a Time Series 


With the introduction of time as a variable, you are now dealing with a notion of 
temporality associated with the data sets. What this means is that certain patterns 
can emerge based on the time stamp, so you can see monthly occurrences of some 
phenomenon. 

To better understand time-series based anomalies, let’s take a random person and 
look into his/her spending habits over some arbitrary month (Figure 1-8). 





: 
2 
O 
O 
O 
O 
: 0 ° ‘ 


OOO 
OO 
@00 O 








———— a . > 
: _ | es; me 
Week 1 Week 2 Week 3 Week 4 


Figure 1-8. Spending habits of a person over the course of a month 


Assume the initial spike in expenditures at the start of the month is due to the 
payment of bills like rent and insurance. During the weekdays, our person occasionally 
eats out, and on the weekends goes shopping for groceries, clothes, or just various items. 

These expenditures can vary from month to month from the influence of various 
holidays. Let’s take a look at November, when you can expect a massive spike in 
purchases on Black Friday (Figure 1-9). 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 











O 
= fs O 
2 § 3 
O 
O 
O 
2) 0°O 
°° 
O 
3 , ' 
O O O O 
O oO O 
————— = —— = ie) eee. 
Week 1 Week 2 Week 3 Week 4 


Figure 1-9. Spending habits for the same person during the month of November 


As expected, there are a lot of purchases made on Black Friday, some of them quite 
expensive. However, this spike is expected since it is a common trend for many people. 
Now assume that unfortunately, your person had his/her credit card information stolen, 
and the criminals responsible for it have decided to purchase various items of interest to 
them. Using the same month as in the first example (Figure 1-8), Figure 1-10 is a possible 
graph showcasing what could happen. 


10 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 














Oo 
g 8 bd 
S O ° 
O 
oO 
O 
O 
8 O O O O 
O 
: g 8 
O O 
ae 7 —————————; —-* 
Week 1 Week 2 Week 3 Week 4 


Figure 1-10. Graph of purchases for the person during the same month as in 
Figure 1-8 


Because of the record of purchases for the user from a previous year, the sudden 
influx in purchases would be flagged as anomalies given the context. Such a cluster of 
purchases might be normal for Black Friday or before Christmas, but in any other month 
without a major holiday it might look out of place. In this case, your person might be 
contacted by the corresponding officials to confirm if they made the purchase or not. 

Some companies might even flag purchases that follow normal societal trends. What 
if that TV wasn’t really bought by your person on Black Friday? In that case, company 
software can ask the client directly through a phone app, for example, whether or not 
he/she actually bought the item in question, allowing for some additional protection 
against fraudulent purchases. 


Taxi Cabs 


Similarly, you can look at the data for taxi cab pickups and drop-offs over time for a 
random city and see if you can detect any anomalies. On an average day, the total 
number of pickups can look somewhat like Figure 1-11. 


11 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 





& 
5 009 e 
E O 
= ° - O 
O © O e) 
Oo © o Oo 
O 
ad O 
O 
oO 
Oo 
‘@) 
O 
ee _ se --.-—<~S«7«<SThmhmh””hC:éi<(i‘“CrS!C!;~;~;~ 
12:00 AM 4:00 AM 8:00 AM 12:00 PM 4:00 PM 8:00 PM 12:00 AM 


Figure 1-11. Graph of the number of pickups for a taxi company throughout 
the day 


From the graph, you see that there’s a bit of post-midnight activity that drops off to 
near nothing during the late-night hours. However, it picks up suddenly around morning 
rush hour and remains high until the evening, when it peaks during evening rush hour. 
This is essentially what an average day looks like. 

Let’s expand the scope out a bit more to gain some perspective of passenger traffic 
throughout the week; see Figure 1-12. 


12 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


Number of Pickups 


O ° O 
o fe) 
O 
© 
l | 
Monday Tuesday Wednesday Thursday Friday Saturday Sunday 


Figure 1-12. Graph of the number of pickups for a taxi company throughout 
the week 


As expected, most of the pickups occur during the weekday when commuters 
must get to and from work. On the weekends, a fair amount of people still go out to get 
eroceries or just go out somewhere for the weekend. 

On a small scale like this, causes for anomalies are anything that prevents taxis from 
operating or incentivizes customers not to use a taxi. For example, say that a terrible 


thunderstorm hits on Friday. Figure 1-13 shows that graph. 


13 


CHAPTER 1 ~=WHAT IS ANOMALY DETECTION? 


Number of Pickups 


l tT ——=—$S TT TT 
Monday Tuesday Wednesday Thursday Friday Saturday Sunday 





Figure 1-13. Graph of the number of pickups for a taxi company throughout the 
week, with a heavy thunderstorm on Friday 


The presence of the thunderstorm could have influenced some people to stay 
indoors, resulting in a lower number of pickups than usual for a weekday. However, 
these sorts of anomalies are usually too small scale and to have any noticeable effect on 
the overall pattern. 

Let’s take a look at the data over the entire year; see Figure 1-14. 


Number of Pickups 





January February March April May june July August September October November December 


Figure 1-14. Number of pickups for a taxi company throughout the year 
14 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


The dips occur around the winter months when snowstorms are expected. Sure 
enough, these are regular patterns that can be observed at similar times every year, 
so they are not an anomaly. But what happens when a polar vortex descends sometime 
in April? 


Number of Pickups 





| 
January February March April May june July August September October November December 


Figure 1-15. Number of pickups for a taxi company throughout the year, with a 
polar vortex hitting the city in April 


As you can see in Figure 1-15, the vortex unleashes several intense blizzards on the 
imaginary city, severely slowing down all traffic in the first week and burdening the city 
in the following two weeks. Comparing this graph from the one above, there’s a clearly 
defined anomaly in the graph caused by the polar vortex for the month of April. Since 


this pattern is extremely rare for the month of April, it would be flagged as an anomaly. 


Categories of Anomalies 


Now that you have some perspective of what anomalies can be in various situations, you 
can see that they generally fall into these broad categories: 


e Data point-based anomalies 
e Context-based anomalies 
e Pattern-based anomalies 


15 


CHAPTER 1 =WHAT IS ANOMALY DETECTION? 


Data Point-Based Anomalies 


Data point-based anomalies can seem comparable to outliers in a set of data points. 
However, anomalies and outliers are not the same thing. Outliers are data points that are 
expected to be present in the data set and can be caused by unavoidable random errors 
or from systematic errors relating to how the data was sampled. Anomalies are outliers 
or other values that one doesn’t expect to exist. These types of anomalies can be found 
wherever a data set of values exists. 

An example of this is a data set of thyroid diagnostic values, where the majority of 
the data points are indicative of normal thyroid functionality. In this case, anomalous 
values represent sick thyroids. While they are not necessarily outliers, they have a low 
probability of existing when taking into account all the normal data. 

You can also detect individual purchases totaling to excessive amounts and label 
them as anomalies since, by definition, they are not expected to occur or have a very low 
probability of occurrence. In this case, they are labeled as fraud transactions, and the 
card holder is contacted to ensure the validity of the purchase. 

Basically, you can say this about the difference between anomalies and outliers: you 
should expect there to be outliers in a set of data, but not anomalies. 


Context-Based Anomalies 


Context-based anomalies consist of data points that might seem normal at first, but 
are considered anomalies in their respective contexts. For example, you might expect 
a sudden surge in purchases near certain holidays, but these purchases could seem 
out of place in the middle of August. As you saw in the example earlier, the person who 
made a high volume of purchases towards Black Friday was not flagged because it is 
typical for people to do so around that time. However, if the purchases were made in a 
month where it is out of place given previous purchase history, it would be flagged as 
an anomaly. This might seem similar to the example brought up for data point-based 
anomalies; the distinction here is that the individual purchase does not have to be 
expensive. If your person never buys gasoline because he/she owns an electric car, 
sudden purchases of gasoline would be out of place given the context. Buying gasoline is 
quite a normal thing to do for everyone, but in this context, it is an anomaly. 


16 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


Pattern-Based Anomalies 


Pattern-based anomalies are patterns and trends that deviate from their historical 
counterparts. In the taxi cab example, the pickup counts for the month of April were 
pretty consistent with the rest of the year. However, once the polar vortex hit, the numbers 
tanked visibly, defining a huge drop in the graph that was labeled as an anomaly. 

Similarly, when monitoring network traffic in the workplace, there are expected 
patterns of network traffic that are formed from constant monitoring of data over several 
months or even years for some companies. When an employee attempts to download 
or upload large volumes of data, it will generate a certain pattern in the overall network 
traffic flow that could be considered anomalous if it deviates from the employee’s usual 
behavior. 

If an external hacker decided to DDOS the company’s website (DDOS, or a 
distributed denial-of-service attack, is an attempt to overwhelm the server that handles 
network flow to a certain website in an attempt to bring the entire website down or 
stop its functionality), every single attempt would register as an unusual spike in 
network traffic. All of these spikes are clearly deviants from normal traffic and would be 


considered anomalous. 


Anomaly Detection 


With a better understanding of the different types of anomalies you can encounter, you 
can now proceed to start creating models to detect them. Before you do that, there are a 
couple approaches you can take, although you are not limited to just these methods. 

Recall the reasoning for labeling the swan as an anomaly. One of the reasons was 
that since all the swans you saw thus far were white, the black swan was the anomaly. 
Another reason was that since the probability of a swan being black was very low, it was 
an anomaly since you didn’t expect that outcome. 

The anomaly detection models you will explore in this book will follow these 
approaches by either training on normal data to classify anomalies, or classifying 
anomalies by their probabilities if they are below a certain threshold. However, in one 
of the classes of models that you choose, the anomalies and normal data points will 
both labeled as such, so you will basically be told what swans are normal and what 


swans are anomalies. 


17 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


Finally, let’s explore anomaly detection. Anomaly detection is the process in 
which an advanced algorithm identifies certain data or data patterns to be anomalous. 
Heavily related to anomaly detection are the tasks of outlier detection, noise removal, 
and novelty detection. In this book, you will explore all of these options as they are all 


basically anomaly detection methods. 


Outlier Detection 


Outlier detection is a technique that aims to detect anomalous outliers within a given 
data set. As discussed, three methods that can be applied to this situation are to train 
only on normal data to identify anomalies by a high reconstruction error, to model a 
probability distribution in which anomalies are labeled based on their association with 
really low probabilities, or to train a model to recognize anomalies by teaching it what an 
anomaly looks like and what a normal point looks like. 

Regarding the high reconstruction error, think of the model as having trouble 
labeling an anomaly because it is odd compared to all the normal data points that it has 
seen. Just like how the black swan is really different based on your initial assumption that 
all swans are white, the model perceives this anomalous data point as “different” and has 
a harder time interpreting it. 


Noise Removal 


In noise removal, there is constant background noise in the data set that must be filtered 
out. Imagine that you are at a party and you are talking to your friend. There is a lot of 
background noise, but your brain focuses on your friend’s voice and isolates it because 
that’s what you want to hear. Similarly, the model learns an efficient way to represent the 
original data so that it can reconstruct it without the anomalous interference noise. 

This can also be a case where an image has been altered in some form, such as by 
having perturbations, loss of detail, fog, etc. The model learns an accurate representation 
of the original image and outputs a reconstruction without any of the anomalous 


elements in the image. 


Novelty Detection 


Novelty detection is very similar to outlier detection. In this case, a novelty is a data 
point outside of the training set, the data set the model was exposed to, that was shown 


18 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


to the model to determine if it is an anomaly or not. The key difference between novelty 
detection and outlier detection is that in outlier detection, the job of the model is to 
determine what is an anomaly within the training data set. In novelty detection, the 
model learns what is a normal data point and what isn’t, and tries to classify anomalies 
in a new data set that it has never seen before. 


The Three Styles of Anomaly Detection 


It is important to note that there are three overarching “styles” of anomaly detection. 
They are 


e Supervised anomaly detection 
e Semi-supervised anomaly detection 
e Unsupervised anomaly detection 


Supervised anomaly detection is a technique in which the training data has labels 
for both anomalies and for normal data points. Basically, you tell the model during the 
training process if a data point is an anomaly or not. Unfortunately, this isn’t the most 
practical method of training, especially because the entire data set needs to be processed 
and each data point needs to be labeled. Since supervised anomaly detection is basically 
a type of binary classification task, meaning the job of the model is to categorize data 
under one of two labels, any classification model can be used for the task, though not 
every model can attain a high level of performance. An example of this can be seen in 
Chapter 7 with the temporal convolutional network. 

Semi-supervised anomaly detection involves partially labeling the training data 
set. In the context of anomaly detection, this can be a case where only the normal data 
is labeled. Ideally, the model will learn what normal data points look like, so that the 
model can flag anomalous data points as anomalies since they differ from normal data 
points. Examples of models that can use semi-supervised learning for anomaly detection 
include autoencoders, which you will learn about in Chapter 4. 

Unsupervised anomaly detection, as the name implies, involves training the model 
on unlabeled data. After the training process, the model is expected to know what 
data points are normal and what points are anomalous within the data set. Isolation 
forest, a model you will explore in Chapter 2, is one such model that can be used for 
unsupervised anomaly detection. 


19 


CHAPTER 1 =WHAT IS ANOMALY DETECTION? 


Where Is Anomaly Detection Used? 


Whether we realize it or not, anomaly detection is being utilized in nearly every facet 
of our lives today. Pretty much any task involving data collection of any sort could have 
anomaly detection applied to it. Let’s look at some of the most prevalent fields and topics 


that anomaly detection can be applied in. 


Data Breaches 


In today’s age of big data, where huge volumes of information are stored about users 

in various companies, information security is vital. Any information breaches must 

be reported and flagged immediately, but it is hard to do so manually at such a scale. 
Data leaks can range from simple accidents such as losing a USB stick that contains a 
company’s sensitive information to employees intentionally sending data to an outside 
party to intrusion attacks that attempt to gain access to the database. You must have 
heard of some high-profile data leaks, such as the Facebook security breach, the iCloud 
data breach, and the Google security breach where millions of passwords were leaked. 
All of those companies operate on an international scale, requiring automation to 
monitor everything in order to ensure the fastest response time to any breach. 

The data breaches might not even need network access. For example, an employee 
could email an outside party or another employee with connections to rival companies 
about travel plans to meet up and exchange confidential information. Anomaly 
detection models can sift through and process employee emails to flag any suspicious 
employees. The software can pick up key words and process them to understand the 
context and decide whether or not to flag an employee’s email for review. 

When employees try to upload data to another connection, the anomaly detection 
software can pick up on the unusual flow of data while monitoring network traffic and 
flag the employee. An important part of an employee’s regular work day would be to 
pull and push to a code repository, so one might expect regular spikes in data transfer in 
these cases. However, the software takes into account lots of variables, including who the 
sender is, who the recipient is, how the data is being sent (in erratic intervals, all at once, 
or spread out over time). In either case, something won’t add up, which the software will 
pick up and then it will flag the employee. 

The key benefit to using anomaly detection in the workspace is how easy it is to scale 
up. These models can be used for small companies as well as large-scale international 


companies. 


20 


CHAPTER 1 = WHAT IS ANOMALY DETECTION? 


Identity Theft 


Identity theft is another common problem in today’s society. Thanks to the development 
of online services allowing for ease of access when purchasing items, the volume of 
credit card transactions that take place every day has grown immensely. However, 

this development also makes it easier to steal credit card information or bank account 
information, allowing the criminals to purchase anything they want if the card isn’t 
deactivated or if the account isn’t secured again. Because of the huge volume of 
transactions, it can get hard to monitor everything. However, this is where anomaly 
detection can step in and help, since it is highly scalable and can help detect fraud 
transactions the moment the request is sent. 

As you Saw earlier, context matters. If a transaction is made, the software will take 
into account the card holder’s previous history to determine if it should be flagged or not. 
Obviously, a high value purchase made suddenly would raise alarms immediately, but 
what if the criminals were smart enough to realize that and just make a series of purchases 
over time that won’t put a noticeable hole in the card holder’s account? Again, depending 
on the context, the software would pick up on these transactions and flag them again. 

For example, let’s say that someone’s grandmother was recently introduced to 
Amazon and to the concept of buying things online. One day, unfortunately, she 
stumbles upon an Amazon lookalike and enters her credit card information. On the 
other side, some criminal takes it and starts buying random things, but not all at once 
so as not to raise suspicion-or so he thought. The identify theft insurance company 
starts noticing some recent purchases of batteries, hard drives, flash drives, and other 
electronic items. While these purchases might not be that expensive, they certainly 
stand out when all the purchases made by the grandmother up until now consisted 
of groceries, pet food, and various decoration items. Based on this previous history, 
the detection software would flag the new purchases and the grandmother would be 
contacted to verify these purchases. These transactions can even be flagged as soon 
as an attempt to purchase is made. In this case, either the location or the transactions 
themselves would raise alarms and stop the transaction from being successful. 


Manufacturing 


You explored a use case of anomaly detection in manufacturing. Manufacturing plants 
usually have a certain level of quality that they must ensure that their products meet 
before shipping them out. When factories are configured to produce massive quantities 


21 


CHAPTER 1 =WHAT IS ANOMALY DETECTION? 


of output at a near constant rate, it becomes necessary to automate the process of 
checking the quality of various samples. Similar to the screw example, manufacturing 
plants in real life might test to uphold the quality of various metal parts, tools, engines, 
food, clothes, etc. 


Networking 


Perhaps one of the most important use cases that anomaly detection has is in 
networking. The internet is host to a vast array of various websites that are located 

all around the world. Unfortunately, due to the ease of access to the Internet, various 
individuals can access the Internet with nefarious purposes. Similar to the data leaks that 
were discussed earlier in the context of protecting company data, hackers can launch 
attacks on other websites as well to leak their information. 

One such example is hackers attempting to leak government secrets through a 
network attack. With such sensitive information as well as the high volumes of expected 
attacks every day, automation is a necessary tool to help cybersecurity professionals deal 
with the attacks and preserve state secrets. On a smaller scale, hackers might attempt to 
breach individual cloud networks or a local area network and try to leak data. Even in 
smaller cases like this, anomaly detection can help detect network intrusion attacks as 
they happen and notify the proper officials. An example data set for network intrusion 
anomaly detection is the KDD Cup 1999 data set. This data set contains a large amount 
of entries that detail various types of network intrusion attacks as well as a detailed list of 
variables for each attack that can help a model identify each type of attack. 


Medicine 


Moving on from networking, anomaly detection has a massive role to play in the field of 
medicine. For example, models can detect subtle irregularities in a patient’s heartbeat 
in order to classify diseases, or they can measure brainwave activity to help doctors 
diagnose certain conditions. Beyond that, they can help analyze raw diagnostic data fora 
patient’s organ and process it in order to quickly diagnose any possible problems within 
the patient, similarly to the thyroid example discussed earlier. 

Anomaly detection can even be used in medical imagery to determine if a given 
image contains anomalous objects or not. For example, if a model was only exposed to 
MRI imagery of normal bones and was shown an image of a broken bone, it would flag 


22 


CHAPTER 1 WHAT IS ANOMALY DETECTION? 


the new image as an anomaly. Similarly, anomaly detection can even be extended to 
tumor detection, allowing for the model to analyze every image in a full body MRI scan 
and look for the presence of abnormal growth or patterns. 


Video Surveillance 


Anomaly detection also has uses in video surveillance, where anomaly detection 
software can monitor video feeds and help flag any videos that capture anomalous 
action. While this might seem dystopian, it can certainly help catch criminals or 
maintain public safety on busy streets or in cities. For example, this software could 
identify a mugging in a street at night as an anomalous event and alert authorities who 
can call in police officers. Additionally, it can detect unusual events at crossroads such as 
an accident or some unusual obstruction and immediately call attention to the footage. 


Summary 


Generally, anomaly detection is utilized heavily in medicine, finance, cybersecurity, 
banking, networking, transportation, and manufacturing, but it is not just limited 
to those fields. For nearly every case imaginable involving data collection, anomaly 
detection can be put to use to help users automate the process of detecting anomalies 
and possibly removing them. Many fields in science can utilize anomaly detection 
because of the large volume of raw data collection that goes on. Anomalies that would 
interfere with the interpretation of results or otherwise introduce some sort of bias into 
the data could be detected and removed, provided that the anomalies are caused by 
systematic or random errors. 

In this chapter, we discussed what anomalies are and why detecting anomalies can 
be very important to the data processing we have at our organizations. 

In the next chapter, we will look at traditional statistical and machine learning 
algorithms for anomaly detection. 


23 


CHAPTER 2 


Traditional Methods 
of Anomaly Detection 


In this chapter, you will learn about traditional methods of anomaly detection. You 
will also learn how various statistical methods and machine learning algorithms work 
and how they can be used to detect anomalies and how you can implement anomaly 
detection using several algorithms. 

In a nutshell, the following topics will be covered throughout this chapter: 


e A data science review 
e The three styles of anomaly detection 
e The isolation forest 


e One-class support vector machine (OC-SVM) 


Data Science Review 


It is important to understand some basic data science concepts in order for you to 
evaluate how well your model performs and to compare its performance with other 
models. 

First of all, the goal in anomaly detection is to determine whether or not a given 
point is an anomaly or not. Essentially, you are labeling a data point x with a class y. 
Assume that in some context, you are trying to classify whether or not an animal tests 
positive (meaning yes) for some disease. If the animal is diseased and it tests positive, 
this case is a true positive. If the animal is healthy and the test shows negative (meaning 
it doesn’t have the disease), then it’s called a true negative. However, there are cases 


20 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_2 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


where the test can fail. If the animal is healthy but the test says positive, this case is a 
false positive. If the animal is diseased but the test shows negative, this case is a false 
negative. 

In statistics, there are similar terms to false positive and false negative: type I 
error and type II error. These errors are used in hypothesis testing where you have 
a null hypothesis (which usually says that there is no relation between two observed 
phenomena), and an alternate hypothesis (which aims to disprove the null hypothesis, 
meaning there is a statistically significant relation between the two observations). 

A type I error is when the null hypothesis turns out to be true, but you reject it 
anyways in favor of the alternate hypothesis. In other words, a false positive, since you 
reject what turns out to be true to accept something that is false. A type II error is when 
the null hypothesis is accepted to be true (meaning you don’t reject the null hypothesis), 
but it turns out the null hypothesis is false, and that the alternate hypothesis is true. This 
is a false negative, since you accept what is false, but reject what is true. 

For the context of the following definitions, assume that the condition is what you're 
trying to prove. It could be something as simple as “this is animal sick.’ The condition 
of the animal is either sick or healthy, and you're trying to predict if it is sick or healthy. 


Here are some definitions: 


e True positive: When the condition is true, and the prediction is 


also true 


e True negative: When the condition is false, and the prediction is 


also false 
e False positive: When the condition is false, but the prediction is true 
e False negative: When the condition is true, but the prediction is false 


Putting them together, you can form what is called a confusion matrix (Figure 2-1). 
One thing to note is that in the case of anomaly detection, you only need a 2x2 confusion 


matrix since data points are either anomalies or they are normal data. 


26 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


Actual 


re 


Prediction 


Predicted True Positive | False Positive 
True (Type | error) 


4 (Positive) 


Predicted False True Negative 
False Negative 
(Negative) (Type II error) 





Figure 2-1. Confusion matrix 
From the values in each of the four squares, you can derive values for accuracy, 


precision, and recall to gain a better understanding of how your model performs. 


Here’s the confusion matrix with all the formulas (Figure 2-2): 


2d 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Actual 
= 
2 
iS 
2) 
v 
ou 
Positive T _ TP 
Precision = ——— 
TP+FP 
Negative 
TP+T0N 
Accuracy = (rP + TN) 
Total 





Recall = ——_—__—— 
(iP + FN) 


Figure 2-2. Precision, Accuracy and Recall 


e Precision is a measurement that describes how many of your true 
predictions actually turned out to be true. In other words, for all of 


your true predictions, how many did the model get right? 


e Accuracy is a measurement that describes how many predictions 
you got right over the entire data set. In other words, for the entire 
data set, how many did the model correctly predict were positive 


and negative? 


e Recall is a measurement that describes how many you predicted true 
for all data points that were actually true. In other words, for all of 
the true data points in the data set, how many of them did the model 


predict correctly? 


28 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


From here, you can derive more values. 

Fl Score is the harmonic mean of precision and recall. It’s a metric that can tell us 
how accurate the model is, since it takes into account both how well the model makes 
true predictions that are actually true, and how many of the total true predictions that 
the model correctly predicted. 


2* Precision * R 
F1 Score = recision * Recall 


Precision + Recall 


The true positive rate (TPR) = recall = sensitivity. The same as recall, the TPR 
tells us how many of the data points that are actually true were predicted as true by 
the model. 


FP 
The false positive rate (FPR )=( 1-specificity )= ————— 
. ( [eee ty) FP+TN 
The FPR tells us how many of the data points that are actually false were predicted to 
be positive by the model. The formula is similar to recall, but instead of the proportion 
of true positives to all of the true data points, it’s the proportion of false positives to all of 
the false data points. 


ae TN 

Specificity =1-FPR= ee 

Specificity is very similar to recall in that it tells us how many of the data points that 
are actually false were predicted as false by the model. 

We can use the TPR and the FPR to form a graph known as a receiver operating 
characteristic curve, or ROC curve. From the area under the curve, or AUC (you 
may see this called area under the curve of the receiver operating characteristic, or 
AUROC), a data point, meaning the probability of the model to have a true positive or 
true negative case. This curve can also be called an AUCROC curve. 

ROC curve with AUC = 1.0 (Figure 2-3). 


20 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


AUC = 1.0 


True Positive Rate 





False Positive Rate 
Figure 2-3. ROC curve with AUC = 1.0 


This is the most ideal AUC curve. However, it is nearly impossible to attain, so a goal 
of AUC > 0.95 is most desirable. The closer we can get the model to attaining a value of 
1.0 for the AUC, the more the probability of the model to predict a true positive or true 
negative case. The AUC value in the graph above indicates that this probability is 1.0, 
meaning it predicts it correctly 100% of the time. However, an extremely high AUC value 
of say 0.99999 could indicate that the model is overfitting, meaning it’s getting really 
good at predicting labels for this particular data set. You will explore this concept a bit 
further in the context of support vector machines, but you want to avoid overfitting as 
much as possible so that the model can perform well even when introduced to new data 
that includes unexpected variations. 

It is important to mention that although the AUC can be 0.99, for example, it is not 
guaranteed that the model will continue to perform at that high of a level outside of 
the training data set (the data used to train the model so that it can learn to classify 
anomalies and normal data). This is because in the real world, there is the factor of 
unpredictability that even has humans confused at times. The world would be a simpler 
place if data is black and white, so to speak, but more often than not, there is a huge 


30 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


gray area (are we sure that point is X and not Y? Is this really an anomaly or just a really 
weird case of a normal point?). For deep learning models, it is important that they keep 
achieving high AUC scores when exposed to new data that includes plenty of variation. 
Basically, it’s a reasonable assumption to expect a slight drop in performance when 
exposing your model to new data outside of your training set. 

The goal with training models is to avoid overfitting and to keep the AUC as high as 
possible. If the AUC turns out to be 0.99999 even after being exposed to an extremely 
large sample of new data that includes a lot of variety, that means the model is basically 
about as ideal of a model we can get and has far surpassed human performance, which 
is impossible for the time being. 

ROC curve with AUC = 0.75 (Figure 2-4) 


AUC =0.75 


True Positive Rate 


False Positive Rate 
Figure 2-4. ROC curve with AUC = 0.75 


31 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


The value for the AUC indicates that the model correctly predicts labels for data points 
only 75% of the time. It’s not bad, but it’s not good, so there’s clearly room to improve. 
ROC curve with AUC = 0.5 (Figure 2-5) 


AUC =0.5 


True Positive Rate 


False Positive Rate 
Figure 2-5. ROC curve with AUC = 0.5 


The value for the AUC indicates that the model only has a 50% chance, or a 
probability of 0.5, to predict the correct label. This is about the worst AUC value 
you can get, since it means the model cannot distinguish between the positive and 
negative classes. 

ROC curve with AUC = 0.25 (Figure 2-6) 


32 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


AUC =0.25 


True Positive Rate 


False Positive Rate 
Figure 2-6. ROC curve with AUC = 0.25 


In this case, the model only has a probability of 0.25 to predict the right label, but this 
just means that it has a 0.75 probability of predicting the incorrect label. In the case that 
the AUC is 0, this means that the model is perfect at predicting the wrong label, meaning 
the labels are switched. If the AUC is < 0.5, this means the model gets better at predicting 
incorrectly as the AUC approaches 0.0. It’s the perfectly opposite case of when the AUC is 
> 0.5, where the model gets better at predicting correctly as the AUC approaches 1.0. 

In any case, you want the AUC to be > 0.5, and at least greater than 0.9 and ideally 
greater than 0.95. 


33 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


Isolation Forest 


An isolation forest is a collection of individual tree structures that recursively partition 

the data set. In each iteration of the process, a random feature is selected, and the data 

is split based on a randomly chosen value between the minimum and maximum of 

the chosen feature. This is repeated until the entire data set is partitioned to form an 

individual tree in the forest. Anomalies generally form much shorter paths from the 

root than normal data points since they are much more easily isolated. You can find the 

anomaly score by using a function of the data point involving the average path length. 
Applying an isolation forest to an unlabeled data set in order to catch anomalies is an 

example of unsupervised anomaly detection. 


Mutant Fish 


To better understand what an isolation forest does, let’s look at an imaginary scenario. 
At a particularly large lake, an irresponsible fish breeder has released a mutant species 
of fish that looks eerily similar to the native species, but are on average bigger than the 
native species. Additionally, the proportion of the length of its tail fin to the length of its 
body is larger than the native species. All in all, there are three features you can use to 
distinguish the invasive, mutant species from the native species. 

Here’s a visual example detailing the differences of an average specimen of both 
species. You can see the native species in Figure 2-7. 


—— 


Co 
- eg ~< 


Figure 2-7. This is an example of the native species at this lake 





34 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


You can see the invasive species in Figure 2-8. 





Figure 2-8. This is an example of the new, mutant species that has been released 
into the lake 


The invasive species is larger, has a bigger circumference, and has a longer tailfin 
on average (compare Figure 2-7 to Figure 2-8). However, the problem is that while the 
average specimen of each species has some noticeable distinctions between them, there 
is plenty of overlap between the two species where some of the native species grow large, 
some of the mutant species are just smaller, both have varying tail fin sizes, etc. so the 
differences might not always be as clear-cut. 

To find out the extent of this infiltration, a large group of fishermen have been 
assembled and presented with the task of identifying the species of each fish in a catch 
of 1,000 fish. In this case, assume that each fisherman will randomly profile each fish to 
determine whether it is a member of the native species or not. 

Now onto the evaluations. Each fisherman first picks a random feature to judge 
the samples on: the length of the fish, the circumference of the fish, or the proportion 
of its tail fin to its overall length. Then, the fisherman picks a random value between 
the known minimum and maximum values of the corresponding measurement for the 
native species and splits all the fish accordingly (all fish with the relevant measurement 
equal to or bigger than the picked value go right, and everything else goes left, for 
example). The fisherman repeats the entire process over and over again until every 
single fish has been partitioned and a “tree” of fish has been created. 


35 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In this case, each individual fisherman represents a tree in the isolation forest, and 
the resulting trees of the entire group of fishermen represent an isolation forest. Now, 
given a random fish in the entire catch, you can get an anomaly score to see how many 
of the fisherman found that this fish is anomalous. Based on the threshold you pick for 
the anomaly score, you can label certain fish as the invasive species and the others as the 
native species. 

However, the problem is that this is not a perfect system; there will be some invasive 
fish that pass off as native fish, and some native fish that pass off as invasive species. 
These cases represent false positives and false negatives. 


Anomaly Detection with Isolation Forest 


Now that you understand more about how an isolation forest works, you can move on to 
applying it to a data set. Before you start, it is important to note that an isolation forest 
performs well on high-dimensional data. For the invasive fish example, you had three 
features to work with: fish length, circumference, and proportion of tail fin length to 
overall length. In this next example, you will have 42 features per data entry. 

You will use the KDDCUP 1999 data set, which contains an extensive amount of 
data representing a wide variety of intrusion attacks. In particular, you will focus on all 
data entries that involve an HTTP attack. The data set can be found at http://kdd.ics. 
uci.edu/databases/kddcup99/kddcup99.html. After opening the link, you should see 
something like Figure 2-9. 


36 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


KDD Cup 1999 Data 


Abstract 


This is the data set used for The Third International Knowledge Discovery and Data Mining Tools 
detector, a predictive model capable of distinguishing between “bad” connections, called intrusion: 


Information files: 
e task description. This is the original task decription given to competition participants. 
Data files: 


e kddcup.names A list of features. 

e kddcup.data.gz The full data set (18M; 743M Uncompressed) 

e kddcup.data 10 percent.gz A 10% subset. (2.1M; 75M Uncompressed) 

e kddcup.newtestdata 10 percent unlabeled.gz (1.4M; 45M Uncompressed) 
e kddcup testdata_unlabeled.gz (11.2M; 430M Uncompressed) 

e kddcup.testdata unlabeled 10 percent.gz (1.4M:45M Uncompressed) 

® corrected.gz Test data with corrected labels. 


e training attack types A list of intrusion types. 
e typo-correction txt A brief note on a typo in the data set that has been corrected (6/26/07) 


The UCI KDD Archive 


Information and Computer Science 


University of California, Irvine 
Irvine, CA 92697-3425 
Last modified: October 28, 1999 


Figure 2-9. This is what you should see when you open the link 


Download the kddcup.data.gz file and extract it. 
There shouldn't be any issues with version mismatch and code functionality, but just 


in case, the exact Python 3 packages used in this example are as follows: 
e numpy 1.15.3 
e pandas 0.23.4 
e  scikit-learn 0.19.1 
e matplotlib 2.2.2 


First, import all the necessary modules that your code calls upon (Figure 2-10). 


37 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 

from sklearn.ensemble import IsolationForest 

from sklearn.model selection import train test split 


from sklearn.preprocessing import LabelEncoder 


smatplotlib inline 





Figure 2-10. Importing numpy, pandas, matplotlib.pyplot, and sklearn modules 


The module numpy is a dependency of many of the other modules since it allows 
them to perform high levels of computation. Pandas is a module that allows us to 
read data files of various formats in order to store them as data frame objects, and it is 
a popular framework for data science in general. These data frames hold data entries 
in a similar fashion to arrays and can be thought of as a table of values. Matplotlib is 
a Python library that allows us to customize and plot data. Finally, scikit-learn is a 
package that allows us to apply various machine learning models to data sets as well as 
provide tools for data analysis. 

yMatplotlib inline allows for graphs to be displayed below the cell and to be saved 
alongside the notebook. 

Next, define the columns and load the data frame (Figure 2-11). 


38 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


columns = [“duration", “protocol type", “service”, “flag”, “sre bytes", 
"dst bytes", "land", "wrong fragment", "urgent", 


"hot", “num failed logins", “logged in", “num compromised", 
"SOOt shell™, “sw attempted", “num root", 


"num file creations", "num shells", “num _access files", 
“num outbound cmds", “is host login", 


"is guest login", “coun”, “Srv Counc", “Serror rate”, 
"SEY SCYrrOr rate”; "EGrror vate", “SrV Ferror rate”, 


"Same Srv race”, "Gilt srv rate", “srv ditt Host. rate”, 
"det host count", “dst host. sry count", 


"dst. NOSst same Srv rate", “dst Nost. ditt sry rate", 
"dst. NOSt same src port rate", “dst Nost srv ditt host. rate”, 


"St. NOSt Serror rate”, “dst host sry Serror rate”, 
"ast Kost terror fate”, “dst host sry rerror rate", “Label” ] 


df = pd.read_ csv ("datasets/kdd cup 1999/kddcup.data/kddcup.data.corrected", 
sep=",", names=columns, index col=None) 





Figure 2-11. You define all of the columns and save the data set as a variable 
named df 


Each data entry is massive, with 42 columns of data per entry. The exact name 
doesn’t matter, but it’s important to have “service” and “label” stay the same. The entire 


list of columns names is as follows: 
e duration 


e protocol_type 


e service 

e flag 

e src_bytes 
e dst_bytes 
e land 


e wrong fragment 
e urgent 


e hot 


39 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


40 


num_failed_logins 
logged_in 
num_compromised 
root_shell 
su_attempted 
num_root 
num_file_creations 
num_ shells 
num_access_files 
num_outbound_cmds 
is_host_login 
is_guest_login 

count 

srv_count 

serror_rate 
srv_serror_rate 
rerror_rate 
srv_rerror_rate 
same_srv_rate 
diff_srv_rate 
srv_diff_host_rate 
dst_host_count 
dst_host_srv_count 
dst_host_same_srv_rate 
dst_host_diff_srv_rate 


dst_host_same_src_port_rate 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


e dst_host_srv_diff_host_rate 
e dst_host_serror_rate 
e dst_host_srv_serror_rate 
e dst_host_rerror_rate 
e dst_host_srv_rerror_rate 
e label 
To get the dimensions of the table, or shape, as it’s referred to in pandas, do 
df.shape 
or if you're not in Jupyter, do 
print (df.shape) 
In Jupyter, you should see something like Figure 2-12 after running the code. 
In [87]: df.shape 


Out(87]: (4898431, 42) 


Figure 2-12. The output is a tuple that describes the dimensions of the data frame 


As you Can see, this is a massive dataset. 
Next, filter out the entire data frame to only include data entries that involve an 
HTTP attack, and drop the service column (Figure 2-13). 


df df[dt["service"] == "http"] 


df = df.drop("service", axis=1) 


columns.remove ("Service") 





Figure 2-13. Filtering df to only have HTTP attacks and removing the service 
column from df 


Just to make sure, check the shape of df again (Figure 2-14). 


4] 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [91]: df.shape 
Out(91]: (623091, 41) 
Figure 2-14, The dimensionality of the filtered df 
The number of rows has been drastically reduced, and the column count went 
down by one because you removed the service column since you don’t actually need it 
anymore. 
Let’s check all the possible labels and the number of counts for each label, just to get 


a feel of the data distribution. 
Run the following: 


df[ "label" ].value_counts() 
or 
print(df[ "label" ].value_ counts()) 


You should see something like Figure 2-15. 


In [93]: L df["label"].value_counts() 
Out[(93]: normal. 619046 

back. 2203 

neptune. 1801 

portsweep. 16 

ipsweep. 13 

satan. 7 

phf. 5 

nmap. 1 


Name: label, dtype: into4 


Figure 2-15. The unique labels in df along with the number of instances of data 
points in df with that specific label 


The vast majority of the data set is comprised of normal data entries, with around 
0.649% of data entries for all HTTP attacks comprising actual intrusion attacks. 

Additionally, some of the columns have categorical data values, meaning the model 
will have trouble training on them. To bypass this issue, you use a built-in feature of 
scikit-learn called a label encoder. 

Figure 2-16 shows what you currently see if you run df.head(5), meaning you want 
five entries to display. 


42 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 
In [97]: | df.head(5) 


duration protocol_type flag src_bytes dst_bytes 
0 0 tcp SF 215 45076 


1 0 tcp SF 162 4528 
2 0 tcp SF 236 1228 
3 0 tcp SF 233 2032 
4 0 tcp SF 239 486 


Figure 2-16. A line of code to display the top five entries in the table. In this case, 
the image has been cropped to show the first few columns 


You can also run print(df.head(5)), but it prints in a text format (Figure 2-17). 


In [98]: print (df.head(5) ) 

duration protocol type flag src bytes dst_bytes land wrong fragment \ 

0 0 tcp SF 215 45076 0 0 

1 0 tcp SF 162 4528 0 0 

2 0 tcp SF 236 1228 0 0 

3 0 tcp SF 233 2032 0 0 

4 0 tcp SF 239 486 0 0 
urgent hot num failed logins logged in num compromised root shell \ 

0 0 0 0 1 0 0 

i 0 0 0 1 0 0 

2 0 0 0 1 0 0 

3 0 0 0 1 0 0 

4 0 0 0 1 0 0 


Figure 2-17. The same function as in Figure 2-16, but in text format 


To resolve this issue, the label encoder takes the unique (meaning one entry per 
categorical value instead of multiple) list of categorical values and assigns a number 
representing each of them. If you had an array like 


[ "John", "Bob", "Robert" ], 
the label encoder would create a numerical representation like 
[O, 1, 2], where O represents "John", 1 represents "Bob", and 2 represents 


"Robert." 


43 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


Now do the same with the labels in your data frame. 


Run the code in Figure 2-18. 


for col in df.columns: 


1f df[col].dtype == "object": 


encoded = LabelEncoder () 
encoded. fit(df[col]) 


dfl[col] = encoded.transform(dft[col]) 





Figure 2-18. Applying the label encoder to the columns with data values that are 
strings 


encoded. fit(df[col]) gives the label encoder all of the data in the column from 
which it extracts the unique categorical values from. When you run 


df[col] = encoded.transform(df[col]) 


you are assigning the encoded representation of each categorical value to df[col]. 
Let’s check the data frame now (Figure 2-19). 


In [101]: . dt.head(5) 
Out[(101]: 
duration protocol_type flag src_bytes dst_bytes land 
0 0 0 9 215 45076 0 
1 0 o 9 162 4528 0 
2 0 0 9 236 1228 0 
3 0 0 9 233 2032 0 
4 0 o 9 239 486 0 


Figure 2-19. Looking at the first five entries of df after applying the label encoder 


Good, all the categorical values have been replaced with numerical equivalents. 
Now run the code in Figure 2-20. 


ae 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


for £ in range(0O, 3): 


df = df.iloc[np.random.permutation (len (df) ) ] 


diZ = di[2500000) 
labels = df2["label"] 


dt validate = df [S00000: | 


x Crain, & test, y train, y Test — train test. splacidiz, labels, 
Lest_size = 0.2, random state =— 42) 


x Val, y val = di validate, df validate ("label”] 





Figure 2-20. Shuffling the values in df and creating your training, testing, and 
validation data sets 


With 
df = df.iloc[np.random.permutation(len(df) ) ] 


you are randomly shuffling all the entries in the data set to avoid the problem of 
abnormal entries pooling in any one region of the data set. 
With 


df2 = df[:500000] 


you are assigning the first 500,000 entries of df to a variable df2. 

In the next line of code, labels = df2[ "label" |, you assign the label column to 
the variable labels. Next, you assign the rest of the data frame to a variable named 
df_validate to create the validation data set with df_ validate = df[500000: |. 

To split your data into the training set and testing set, you can use a built-in 
scikit-learn function called train test split, as detailed below: 


x train, x test, y train, y_ test = train test split(df2, labels, 
test_size = 0.2, random state = 42) 


The parameters are as follows: x, y, test_size, and random_state. Note that x and 
y are supposed to be the training data and training labels, respectively, with test_size 
indicating the percentage of the data set to be used as test data. random_state is a 


45 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


number used to initialize the random number generator that determines what data 


entries are chosen for the training data set and for the test data set. 


Finally, you delegate the rest of the data to the validation set. To define the terms 


again: 


Training data is the data that the model trains and learns on. For an 
isolation forest, this set is what the model partitions on. For neural 
networks, this set is what the model adjusts its weights on. 


Testing data is the data that is used to test the model’s performance. 
The train test_split() function basically splits the data into 
a portion used to train on and a portion used to test the model’s 


performance on. 


Validation data is used during training to gauge how the model’s 
training is going. It basically helps ensure that as the model gets 
better at performing the task on the training data, it also gets better 
at performing the same task over new, but similar data. This way, 

the model doesn’t only get really good at performing the task on the 
training data, but can perform similarly on new data as well. In other 
words, you want to avoid overfitting, a situation where the model 
performs very well on a particular data set, which can be the training 
data set, yet the performance noticeably drops when new data is 
presented. A slight drop in performance is to be expected when the 
model is exposed to new variations in the data, but in this case, it is 


more pronounced. 


In this example, you don’t use the validation set or testing set during training, but 


this will come into play later on when you are training neural networks. Instead, you use 


them to evaluate the performance of the model. 


Let’s take a look at the shapes of your new variables by running the code in 
Figure 2-21. 


46 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [140]: print ("Shapes:\nx train:%s\ny train:%s\n" % (x train.shape, y train.shape) ) 
print ("x test:%s\ny test:%s\n" % (x test.shape, y test.shape) ) 
print ("x val:%s\ny val:%s\n" % (x _val.shape, y val.shape) ) 


Shapes: 
x train: (400000, 41) 
y_train: (400000, ) 


x_ test: (100000, 41) 
y_test: (100000, ) 


x_val: (123091, 41) 
y_val: (123091, ) 


Figure 2-21. Getting the shapes of the training, testing, and validation data sets 


To build your isolation forest model, run the following: 


isolation forest = IsolationForest(n estimators=100, max _samples=256, 
contamination=0.1, random state=42) 


Here’s an explanation of the parameters: 


e n_ estimators is the number of trees to use in the forest. The default 
is 100. 


e max_samples is the maximum number of data points that the 
tree should build on. The default is whatever is smaller: 256 or the 
number of samples in the data set. 


e contamination is an estimate of the percentage of the entire data set 
that should be considered an anomaly/outlier. It is 0.1 by default. 


e random_state is the number it will initialize the random number 
generator with to use during the training process. An isolation forest 
utilizes the random number generator quite extensively during the 


training process. 
Now, let’s train your isolation forest model by running 
isolation forest. fit(x train) 


This process will take some time, so get up and stretch for a bit! 

Once it’s finished, you can go about calculating the anomaly scores. Let’s create a 
histogram of the anomaly scores when tested on the validation set. 

Run the code in Figure 2-22. 


47 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


anomaly scores = 1so0lation forest.decision function (x val) 


plt.figure (figsize=(15, 10)) 

plt.hist (anomaly scores, bins=100) 
plt.xlabel('Average Path Lengths', fontsize=14) 
plt.ylabel('Number of Data Points', fontsize=14) 


plt.show () 





Figure 2-22. Getting the anomaly scores from the trained isolation forest model 
and plotting a histogram 


You should see a graph that looks like Figure 2-23. 


20000 
17500 


15000 


=) 
in 
& 
& 


10000 


Number of Data Points 


wa 
S 


2500 





-0.2 -0.1 0.0 
Average Path Lengths 


Figure 2-23. A histogram plotting the average path lengths for the data points. 
It helps you determine what is an anomaly by using the shortest set of path lengths, 
since that indicates that the model was able to easily isolate those points 


48 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


A quick note: p1t. show() is not necessary on Jupyter if you have %matplotlib inline, 
but if you are using anything else, this should open up a new window with the graph. 

Let’s calculate the AUC to see how well the model did. Looking at the graph, there 
appears to be a few anomalous data with average path of less than -0.15. You expect 
there to be a few outliers within the normal range of data, so let’s pick something more 
extreme, such as -0.19. Remember that the lesser the path length, the more likely the 
data is to be anomalous, hence why there’s a curve that increases drastically as the graph 
goes right. Run the code in Figure 2-24. 


from sklearn.metrics import roc auc score 


anomalies — anomaly scores. > -0..19 
Matches. = y val == lasrc (encoded, classes.) <index (“normal.™) 
auc = roc auc score (anomalies; matches) 


Deine (AUC? {s.25}"2tOrmal (auc) ) 





Figure 2-24. Classifying anomalies based on a threshold that you picked from a 
graph and generating the AUC score from that set of labels for each point 


You should see something like Figure 2-25. 


In [167]: from sklearn.metrics import roc auc score 


anomalies = anomaly scores > -0.19 
matches = y val = list(encoded.classes ).index("normal.") 


auc = roc auc score(anomalies, matches) 
print ("AUC: {:.2%}".format (auc) ) 


AUC: 99.81% 
Figure 2-25. The generated AUC score after running the code 
That’s an impressive score! But could it be the result of overfitting? Let’s get the 


anomaly scores of the test set to find out. 
Run the code in Figure 2-26. 


49 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


anomaly ScOres Test = 1s0laLtion FOrest«decision Funclion(x test) 


plt.figure (figsize=(15, 10)) 


PLE. hist (anomaly scores. test, bains=100)} 


plt.xlabel('Average Path Lengths', fontsize=14) 
plt.ylabel('Number of Data Points', fontsize=14) 


plt.show () 





Figure 2-26. Creating a histogram like in Figure 2-23 for the testing set instead of 
the validation set 


You should get a graph like Figure 2-27. 
16000 
14000 
12000 
10000 


8000 


Number of Data Points 


6000 


4000 





-0.2 0.1 0.0 
Average Path Lengths 


Figure 2-27. A histogram like in Figure 2-23, but for the testing set 
There is a similar pattern of what appear to be anomalous data to the left of -0.15. 
Again, assume that there are expected outliers, and pick any average path length less 


than -0.19 as the cutoff for anomalies. 
Run the code in Figure 2-28. 


50 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


anciialiles Lest = enomaly Scores test. > =-U.19 


Matches = y. test == laser (encoded classes )4index("notmal.”) 


auc = Toc auc Score(anomalies test, matches) 


Princ ("AUCs tis.Z2e}"stormeat. (auc) ) 





Figure 2-28. Applying the code in Figure 2-24 to the test set. In this case, the 
threshold was the same, but you still picked it based on the histogram 


It should look like Figure 2-29. 


In [163]: anomalies test = anomaly scores test > -0.19 
matches = y test == list(encoded.classes ).index("normal.") 
auc = roc auc score(anomalies test, matches) 
print ("AUC: {:.2%}".format (auc) ) 


AUC: 99.82% 


Figure 2-29. The generated AUC score for the test set 


That’s really good! It seems to perform very well on both the validation data and the 
test data. 

Hopefully by now you will have gained a better understanding of what an isolation 
forest is and how to apply it. Remember, an isolation forest works well for multi- 
dimensional data (in this case, you had 41 columns after dropping the service column) 
and can be used for unsupervised anomaly detection when applied in the manner 


implemented in this section. 


One-Class Support Vector Machine 


The One-Class SVM is a modified support vector machine model that is well-suited for 
novelty detection (an example of semi-supervised anomaly detection). The idea is 
that the model trains on normal data and is used to detect anomalies when new data is 
presented to it. While the OC-SVM might seem best suited to semi-supervised anomaly 
detection, since training on only one class means it’s still “partially labeled” when 
considering the entire data set, it can also be used for unsupervised anomaly detection. 
You will perform semi-supervised anomaly detection on the same KDDCUP 1999 data 


ol 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


set as the isolation forest example. Similar to the isolation forest, the OC-SVM is also 
good for high-dimensional data. Additionally, the OC-SVM can capture the shape of the 
data set pretty well, a point that will be elaborated upon below. 

To understand how a support vector machine works, first visualize some data on a 
2D plane (Figure 2-30). 





Figure 2-30. Some points plotted so that they group up in two regions on 
the graph 


How do you separate the data into two distinct regions using a line? Well, it’s pretty 


simple (Figure 2-31). 


92 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Figure 2-31. A line that separates the two regions based on the points plotted 


Now you have two regions representing two different labels. However, the problem 
goes a little bit deeper than that. 

The reason the model is called a “support vector machine” is because these “support 
vectors” actually play a huge role in how the model draws the decision boundary, 
represented in this case by the line in Figure 2-32. 


93 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Figure 2-32. The decision boundary drawn with support vectors 


Basically, a support vector is a vector parallel to the hyperplane that acts as the 
decision boundary, containing a point that is closest to the hyperplane, and helps 
establish a margin for the decision boundary. In this example, the hyperplane is a line 
because there are only two dimensions. In three dimensions, the hyperplane would be a 
plane, and in four dimensions, it would be a three-dimensional space, and so on. 

The most optimal hyperplane would involve the support vectors establishing a 
maximum margin for the hyperplane. The example in Figure 2-32 is not optimal, so let’s 
look for a more optimal hyperplane in Figure 2-33. 


o4 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


1 
1 
1 
\ 
I 
1 
1 
1 
J 
1 
1 
! 
1 
i] 
1 
1 
1 
\ 
1 
1 
| 





Figure 2-33. A hyperplane with support vectors that allow for a larger margin 


With how the hyperplane is drawn, the points which their respective support vectors 
pass through are the closest to the hyperplane. This is a more optimal solution for a 


hyperplane since the margin for the hyperplane is much larger than in the previous 
example (Figure 2-32). 


However, realistically, you will see hyperplanes that are more like Figure 2-34. 


99 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 








Figure 2-34. A more realistic example of how a hyperplane functions 


There will always be outliers that prevent a clear distinction between two 
classifications. If you think back to the invasive fish example, there were some native fish 
that looked like invasive fish, and some invasive fish that looked like native fish. 

Alternatively, Figure 2-35 shows a possible solution. 


96 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Figure 2-35. An example of a hyperplane completely separating the two regions. 
However, this is an example of overfitting 


While this does count as a solution to the classification problem, this would lead to 
overfitting, resulting in another issue. If the SVM performs too well on the training data, 
it could perform worse on new data that contains different variations. 

The decision boundaries won't be that simple either. You could run into situations 
such as the one shown in Figure 2-36. 


ov 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Figure 2-36. A graph showcasing a different type of grouping of the data points 


You can’t draw a line for this, so you have to think differently instead of using a linear 
SVM. Let’s try to map the distances of each point from the center of the dark dots onto 
the 3D plane through some function (see Figure 2-37). 


98 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Figure 2-37. Plotting the points onto the 3D plane shows that you can now 
separate the regions 


Now there is a clear separation between the two classes, and you can go ahead with 
separating the data points into two regions, as in Figure 2-38. 


99 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 





Figure 2-38. The hyperplane now is an actual plane because of the added third 
dimension 


When you go back to the 2D representation of the points, you can see something like 
Figure 2-39. 


60 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


O O O @Q O 





Figure 2-39. This is what the hyperplane looks like when you go back to 2D 


What you just did was use a kernel to transform the data into another dimension 
where there is a clear distinction between the classes of data. This mapping of data 
is called a kernel trick. There are different types of kernels, including the linear 
kernel you saw in the earlier examples. Other types of kernels include polynomial 
kernels, which map the data to some nth dimension using a polynomial function, and 
exponential kernels, which map the data according to an exponential function. 

Another term to cover is regularization, a parameter that tells the SVM how much 
you want to avoid misclassifications. Lower regularization values lead to graphs like 
the one you saw earlier where there were a few outliers on either side of the hyperplane. 
Higher regularization values lead to graphs where you saw the hyperplane separate 
every single point, at the cost of possibly overfitting on the data. 

Gamma tells the SVM how much to consider points farther away from the region 
of separation between the classes. Higher gamma values tell the SVM to only consider 
nearby points, while lower gamma values tell the SVM to also consider the points 
farther away. 


61 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


Finally, the margin is the separation between each class and the hyperplane. As 
discussed earlier, an ideal margin involves the maximum equidistant separation of 
each of the closest from the hyperplane. A bad margin or suboptimal margin has 
the hyperplane too close to one class or the distance not be as far as it can be to the 
hyperplane for each point or support vector. 

As for the one-class support vector machine, Figure 2-40 shows what the graph 
would look like. 


@ New “normal” observation 
‘e) Training observation 


@ New “anomalous” observation 





Figure 2-40. An example of the decision boundary for a one-class support vector 
machine 


During training, the OC-SVM learns the decision boundary for normal observations, 
accounting for a few outliers. If novelties, new data points that the model has never seen 
before, fall within this decision boundary, they are considered normal by the model. If 
they fall outside of the boundary, they are considered anomalous. This technique is an 
example of semi-supervised novelty detection, where the goal is to train the model on 
normal data, and then it attempts to find anomalies in new data. 

By doing so, the OC-SVM can capture the shape of the data pretty well thanks to the 
decision boundary that captures most of the training observations. 


62 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


Anomaly Detection with OC-SVM 


Now that you know more about how SVMs work, let’s get started by applying a one-class 
SVM to the KDDCUP 1999 data set. 
Import your modules and load up the data set (see Figure 2-41 and Figure 2-42). 


import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 

from sklearn.model selection import train test split 
from sklearn.preprocessing import LabelEncoder 


from sklearn.svm import OneClassSVM 


smatplotlib inline 





Figure 2-41. Importing your modules for the OC-SVM 


columns = [“duration”, “protocol type", “service”, “Lilag”, “sre bytes”, 
"dst bytes", "land", "wrong fragment", "urgent", 


"hot", "num failed logins", "logged in", "num compromised", 
"POOL Shell", “su. attempted", "nium roor”,; 


"num tile creations”, "num shelis”,. “num access files”, 
"num outbound cmds", "is host login", 


Le Cues. Login, “Count, "Srv COUNT”, “SSrror tate, 
"SEY SCYrror Late", YKGEYOr rate", “SFY Fervor Fate”, 


“Same: Srv vate", “Girt rv rate”, "sry dirt Nnost. rate”, 
"dst host count”, "det Nost. sry count", 


"OSt NOSt: Game Srv race”, “dst Nost. citi sry vate”, 
"dst. NhOSt same Srce port.rate”, “dst Nost. sry diff host rate”, 


"Ost DOSt. SCrror tate", "CSt NOsSt sky Serror rate”, 
"Ge HOSt terror rate", “GSt. host Sry r,error rate", “Label” | 


df = pd.read_ csv("datasets/kdd cup _ 1999/kddcup.data/kddcup.data.corrected", 
sep=",", names=columns, index col=None) 





Figure 2-42. Defining the columns for the data set, and importing the data set 
into the data frame variable df 


63 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


Now, let’s move on to filtering out all the normal data entries. You will make two 
data frames that consist of normal entries and an equal mix of anomalies and normal 
data entries. 

Run the code in Figure 2-43. 


di dfi[dft["service"] == "http"] 
df = df.drop("service", axis=1) 


columns.remove ("Service") 


novelties = df[df["label"] != "normal."] 


novelties normal = df[150000:154045] 


novelties = pd.concat(|novelties, novelties normal))) 


normal = df[dft["label"] == "normal."] 





Figure 2-43. Filtering out the anomalies and the normal data points to construct 
a new data set that is a mixture of the two 


Figure 2-44 shows the shapes of the two data frames. 


In [308]: print (novelties. shape) 
print (normal. shape) 


(8090, 41) 
(619046, 41) 


Figure 2-44, Printing out the shapes of the novelty and normal data sets 
The first half of the data frame “novelties” consists of anomalies, while the latter half 
consists of normal data entries. 


Now you move on to encoding all the categorical values in the data frames (see 
Figure 2-45). 


64 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


col in normal.columns: 


1£f normal[col].dtype == "object": 


encoded = LabelEncoder () 
encoded. fit (normal[col]) 


normal[col] = encoded.transform(normal [col] ) 


col in novelties.columns: 

1£f novelties[col].dtype == "object": 
encoded2 = LabelEncoder () 
encoded2.fit(novelties[col]) 


novelties[col] = encoded2.transform(novelties[col] ) 





Figure 2-45. Applying the label encoder to the data sets 
Now run the code in Figure 2-46 to set up your training, testing, and validation sets. 
for f in range(QO, 10): 


normal = normal.iloc[np.random.permutation(len(normal) ) ] 


df2 = pd.concat ([normal[:100000],normal [200000:250000] ] ) 


df validate = normal [100000:150000] 


xX Urainy, xX USst—= Crain. test splactdiz; Cest. size = 0.2, 
random state = 42) 


x val = df validate 





Figure 2-46. Shuffling the entries in the normal data set, and defining the 
training, testing, and validation sets 


Figure 2-47 shows the shapes of the data sets. 


65 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [12]: 1 print ("Shapes:\nx train:{}\n" .format(x_train.shape) ) 
print ("x test:{}\n".format (x test.shape) ) 
print ("x val:{}\n".format (x val.shape) ) 


Shapes: 

x train: (120000, 41) 
x test: (30000, 41) 

x val: (50000, 41) 


Figure 2-47. Printing the output shapes of the training, testing, and validation sets 


You are only using a subset of the entire data set to train the model on because the 
larger the training data, the longer it takes for the OC-SVM to train. 
Run the code in Figure 2-48 to declare and initialize the model. 


ocsvm = OneClassSVM(kernel='"rbf', gamma=0.00005, random state = 


42, nu=0.1) 





Figure 2-48. Defining your OC-SVM model 


By default, the kernel is set to ‘rbf, meaning radial basis function. It is similar to the 
circular decision boundary that you saw in the earlier examples, and you use it here 
because you want to define a circular boundary around a set of regions that contain 
normal data. As seen in the earlier examples, any points that fall outside of the region 
are to be considered anomalies. Gamma tells the model how much you want to consider 
points further from the hyperplane. Since it is pretty small, this means you want to 
emphasize the points farther away. The random_state is just a seed for initializing the 
random number generator, similar to the isolation forest model. The next parameter, nu, 
specifies how much of the training set contains outliers. Again, you set this to 0.1, similar 
to the isolation forest model. This acts similar to the regularization parameter that you 
saw earlier, since it tells the model approximately how many data points you expect the 
model to misclassify. 

Now let’s train the model and evaluate predictions (see Figure 2-49). 


66 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


ocsvm.fit(x train) 


Figure 2-49. Training the OC-SVM model on the training data 


One thing to note is that you can’t get the values for an AUC curve for x_test and 
x_validation since they comprise entirely of normal data values. You can’t get values for 
true negative or for false positive since there are no anomalies in the data set to classify 
falsely as normal or correctly as anomalies. 

However, you can still measure the accuracy of the model on the test and validation 
sets. Even though accuracy is not the best metric to go by, it can still give you a good 
indicator of the model’s performance. 

Also one thing to note: Accuracy in this case is a measure of the percentage of 
data points in the predictions that are normal data points. Remember, you assumed 
that around 10% of the data points in the data set are anomalies, so the most optimal 
“accuracy” to obtain is 90%. 

Run the code in Figure 2-50. 


preds OGSVlsPrSo1cL (x Cesc) 

score = 0 

for f in range(0, x test.shape[0]): 
if(preds[f] == 1): 


score = score = 1 


accuracy = score / x test.shape[0] 


PEInNt("AcCcuracy> {fazco}".TOrmal (accuracy )) 





Figure 2-50. Making predictions and generating the “accuracy” score 


67 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [33]: 
preds = ocsvm.predict(x test) 
score = 0 
for £ in range(0, x test.shape([0]): 
if (preds[f] == 1): 


score = score + i 


accuracy = score / x test.shape[0] 
print("Accuracy: {:.2%}".format (accuracy) ) 


Accuracy: 89.09% 
Figure 2-51. The resulting output accuracy for the testing data set 
Figure 2-51 shows that the accuracy is about 89.1%, which is pretty good considering 


that you assumed 10% of the data would misclassify. 
Let’s run the code on x_validation this time (see Figure 2-52). 


preds OcSsvil.prediculx. val) 

score 0 

for f in range(0, x val.shape[0]): 
if(preds[f] == 1): 


score = score + 1 


accuracy = score / x Val.shape [0] 


DELaL (Accuracy: {25.26} "<tormat (accuracy) 





Figure 2-52. Generating the accuracy score for the validation set 


68 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [34]: 1 preds = ocsvm.predict(x val) 


a) 
score 0 
for f in range(0, x val.shape[0]): 


if (preds[(f}) = 1): 


score = score + i 


accuracy = score / x val.shape[0] 
print ("Accuracy: {:.2%}".format (accuracy) ) 


Accuracy: 69.49% 


Figure 2-53. The resulting percentage of data points in the predictions that were 
considered normal 


This time the accuracy was even better at around 89.5% (Figure 2-53). 

Now to test on the novelties data set. This time, you can find the AUC score because 
there is a 50-50 split between anomalies and normal data. The other two data sets, x_test 
and x_validation, only had normal data, but this time it is possible for the model to 
classify false positives and true negatives. 

Run the code in Figure 2-54. 


from sklearn.metrics import roc auc score 


preds = ocsvm.predict (novelties) 


matches = novelties["label"] == 4 


auc = £OC auc score (preds;, Matches) 


prane("AUC? {e226} ".10rmeat (auc),) 





Figure 2-54. The code to generate the AUC score 


69 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [4/]: | from sklearn.metrics import roc auc score 


3 preds = ocsvm.predict (novelties) 
4 matches = novelties["label"] == 


auc = roc auc score(preds, matches) 
print("AUC: {:.2%}".format (auc) ) 


AUC: 95.83% 
Figure 2-55. The generated AUC score from the predictions on the novelty set 


Figure 2-55 shows the score. That’s pretty good for an AUC score! 
Let’s look at the distribution of predictions in Figure 2-56. 





plt. figure (figsize=(10,5)) 


plt.hist(preds, bins=[-1.5, -0.5] + [0.5, 1.5], align='mid') 


plt.xticks([-1, 1]) 


plt. show () 





Figure 2-56. Code to display a graph that shows the distributions for the 
predictions 


70 


CHAPTER 2. TRADITIONAL METHODS OF ANOMALY DETECTION 


In [48]: 1 plt.figure (figsize=(10,5)) 
2 plt.hist(preds, bins=[(-1.5, -0.5] + [0.5, 1.5], align='mid') 
3 plt.xticks([-1, 1]) 
plt.show() 


4000 





3000 


2000 


1000 





=] 1 


Figure 2-57. The resulting output. 1 stands for normal data points, and -1 stands 
for anomaly data points 


As you Can see in Figure 2-57, the model ended up predicting more anomalies than 
normal data points, but from what the AUC tells us, it managed to classify most of the 
data entries correctly. 

Hopefully by now you will have gained a better understanding of what an 
OC-SVM is and how to apply it. Remember, OC-SVM works well for multi-dimensional 
data (in this case, you had 41 columns after dropping the service column) and can 
be used for semi-supervised anomaly detection when applied in the manner 
implemented in this section. 


Summary 


In this chapter, we discussed traditional methods of anomaly detection and how they 
can be used to implement anomaly detection in an unsupervised and semi-supervised 
manner. 


In the next chapter, we will look at the advent of deep learning networks. 


71 


CHAPTER 3 


Introduction to Deep 
Learning 


In this chapter, you will learn about deep learning networks. You will also learn how 
deep neural networks work and how you can implement a deep learning neural 
networks using Keras and PyTorch. 

In a nutshell, the following topics will be covered throughout this chapter: 


e What is deep learning? 
e Intro to Keras: A simple classifier model 


e Intro to PyTorch: A simple classifier model 


What Is Deep Learning? 


Deep learning is a special subfield of machine learning that deals with different types of 
artificial neural networks. Drawing inspiration from the structure and functionality of 

a brain, artificial neural networks at their core are layers of interlinked, individual units 
call neurons that each perform a specific function given input data. 

In “deep” learning specifically, some of the best models consist of dozens of layers 
and millions of neurons, and have been trained on multiple gigabytes of data. Generally, 
deep learning models don’t always need to be this big to perform well on certain tasks, 
and the tasks that the large models are expected to perform are complex, ranging from 
outlining a wide variety of objects within an image to generating summaries of articles. 

Thanks to recent increases in the computational power and availability of GPUs 
(graphics processing units), anyone with access to a decent enough GPU can train their 
own deep learning models, keeping in mind that larger models might require more GPU 


resources such as Memory. 


73 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_3 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Today, deep learning is taking the world by storm thanks to the extreme versatility 
and performance that it offers. More traditional models in machine learning have a 
problem where adding more training samples leads to a plateau in performance, but 
that problem doesn’t exist with deep learning. Instead, deep learning models get better 
and better with more samples, meaning they scale far better in terms of data set size 
and gain better performance as a result. Deep learning models can be applied to nearly 
any task with resounding success, and so are employed in the fields of cybersecurity, 
meteorology, finances and stock markets, speech recognition, medicine, search engines, 
etc. What exactly about deep learning makes it so great? First, let’s take a look at what an 
artificial neural network is. 


Artificial Neural Networks 


Artificial neural networks are layers of interconnected nodes, or artificial neurons, that 
function in a way inspired by biological neural networks. Figure 3-1 shows an example of 


da neuron. 


Terminal Axon 


Dendrites 





Figure 3-1. An example of what a neuron can look like 


74 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Inputs are taken in through the dendrites, after which the neuron decides whether 
or not to fire. Upon firing, the neuron sends a signal down the axon to its terminal axons, 


where the signals are output to any other neurons. This transfer of signals is called a 


synapse, which is modeled in Figure 3-2. 





Synapse 


Figure 3-2. How two neurons might connect to form a chain and transfer signals 
through that connection. The terminal axon of the first neuron connects to the 
dendrites of the second neuron 


We use a similar concept in artificial neural networks (Figure 3-3). 







Xy * 

4, 

: New Output 
X2 W> 
i 
Activation 

a Function 

x “ws 


Figure 3-3. How an artificial neuron in an artificial neural network can function. 
This mimicry of the biological neuron is the basis of artificial neural networks 


79 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


In the case of this artificial neuron, we find the dot product between the input 
vector X and the weight vector W. X represents the input data, and W represents the 
list of weights that this node carries to multiply with the input vector. Recall that the 
dot product is when each element in the vector is multiplied with the corresponding 
element in the second vector, as in Figure 3-4. 


<a,b,d>-<e,f,g>=ae+bf+dg 


a e 


b| - |f | =ae+bf+cg 


Figure 3-4. This is how dot product works. Shown here is an example with two 
different types of vector notation 


Both are different ways to represent a vector, although the second method is ideal 
considering your data and weights would most likely take the shape of a matrix. 

After that, there is an optional bias function where the value b (called bias) is added 
to the dot product result. From there, it passes through an activation function that 
decides if the entire node sends data or not. In this case, the activation function only 
varies between 0 and 1 depending on whether or not the dot product plus the bias 
reaches a certain value or not (threshold). It is possible to have other activation functions 
such as a sigmoid function, which outputs some value between 0 and 1. 

Calling the output y and the input x, the basic function for each node can be 
represented by the equation in Figure 3-5. 


rN 
y=f » wixi +b 
i=1 


Figure 3-5. An equation that captures the basic functionality of an artificial 
neuron. In this case, f(x) is an activation function 


An artificial neural network is comprised of interlinked layers of these nodes and can 
look like Figure 3-6. 


76 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Input 
Data Input Layer Hidden Layer1 Hidden Layer 2 








Figure 3-6. An example of what an artificial neural network can look like 


A hidden layer is one that is between the input layer and the output layer. There 
can be multiple hidden layers in a network. Now that you've seen what an artificial 
neural network can look like, let’s take a look at how the data can flow through this 
network. First, we start with nothing but the input data in the network, and assume that 
neurons only wholly activate (neurons can partially activate depending on the activation 
function, but in this example each neuron either outputs a 1 or a 0) (Figure 3-7). 


U/ 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Input 
Data Input Layer Hidden Layer1 Hidden Layer 2 


Output Layer 


odeee 

OOO 0 CO 

OOO 0 CO 
OO O 


Figure 3-7. The input data runs through the input layer, and selective nodes fire 
based on the input received 


The input layer takes all of the corresponding inputs and produces an output that 
is linked to the first hidden layer. The outputs of the nodes that activate in the input 
layer are now the inputs of the hidden layer, and the new data flows correspondingly 
(Figure 3-8). 


78 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Input 
Data Input Layer Hidden Layer1 Hidden Layer 2 


\ Ps / 


f / 
f 


Output Layer 











OOOO CO 
OO O 


Figure 3-8. The outputs from the activated neurons in the input layer pass on 
to the first hidden layer. These outputs are now the inputs of the next layer, and 
selective neurons fire based on this input 


Hidden layer 1 processes the data in a similar fashion to the input layer, just with 
different parameters for activation function, weight, bias, etc. The data passes through 
this layer and the output of this layer becomes the input for the next hidden layer. In this 
case, only two nodes activate based on the input from the previous layer (Figure 3-9). 


fee, 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


Input 
Data Input Layer Hidden Layer1 Hidden Layer 2 





Output Layer 


Figure 3-9. This process repeats with hidden layer 2 


Hidden layer 2 processes the data and sends the data to a new layer called the output 
layer, where only one of the nodes in the layer will be activated. In this case, the first 
node in the output layer is activated (Figure 3-10). 


80 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Input 
Data Input Layer Hidden Layer1 Hidden Layer 2 





Figure 3-10. Finally, the data from the second hidden layer goes to the output 
layer, where one neuron fires in this case 


The nodes in the output layer can represent the different labels that you want to give 
to the input data. For example, in the iris dataset, you can take various measurements of 
an iris flower and train an artificial neural network on this data to classify the species of 
the flower. 

Upon initialization, the weights of the model will be far from ideal. Throughout the 
training process, the data flow from the model goes forward (left to right from input to 
output), and then backwards in what is known as backpropagation to recalculate the 
weights and biases for each activated node. 

In backpropagation, a cost function takes into account the model’s predictions 
for one pass of the training data through the network and what the actual predictions 
should be. The cost function gives you an indicator of how good the model’s weights are 
at predicting the correct outcome. For this example, assume that Figure 3-11 shows the 
formula of the cost function. 


81 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


1 . Ne 
(8) = =» (ho(x!) — y') 


Figure 3-11. The formula for the mean squared error cost function 


This cost function is called the mean squared error, named so because the function 
given input 0, the weights, finds the average difference squared between the predicted 
value and the actual value. The parameter h, represents the model with the weight 
parameter 0 passed in, so h,(x’) gives the predicted value for x‘ with model’s weights 0. The 
parameter y’ represents the actual prediction for the data point at index i. If the parameter 
you are passing in includes both weight and bias, then it will look more like Figure 3-12. 


1X , 
J(w,b) = =» (Iwo(x!) - y') 


Figure 3-12. A formula for the mean squared error cost function, with more 
specific notation separating the weights and the biases 


Note that h,, ,(x') will have the formula in Figure 3-13. 


h(w,b) = wx' +b 
Figure 3-13. An elaboration on what the function h(w, b) means in Figure 3-12 


The cost function reflects the overall performance of the model with the current 
weight parameter, so the most ideal value output from the cost function will be as small 
as possible. Since the cost function is a measure of how far the model’s predictions 
are from the actual value, you want to make the output from the cost function as small 
as possible since that means your predictions were almost what the actual prediction 
should be. 

To minimize the cost function, you need to tell the model how to adjust the weights, 
but how do you do that? If you think back to calculus, optimization problems involved 
finding the derivative and solving for the critical points (points where the derivative of 


the original equation is 0). In your case, you want to find the gradient, which can be 


82 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


thought of as similar to the derivative but in a multi-dimensional setting, and adjust the 
weights in a direction that would change the gradient so it approaches 0. 

There are several optimization algorithms to help the model achieve the optimal 
weights including gradient descent. Gradient descent is an optimization algorithm that 
finds the gradient of the cost function and takes a single step in the direction of the local 
minimum to generate values to use to adjust the weights and biases. 

How much ofa step you take is controlled by the learning rate. The bigger the 
learning rate, the larger the step you take at each iteration, and the quicker the local 
minimum is approached. The smaller the learning rate, the longer the training takes 
since the steps are smaller. However, a problem with too large of a learning rate is that it 
could overshoot the local minimum entirely, leading to a complete failure to ever reach 
the local minimum. Too small of a learning rate and the local minimum might take way 
too long to reach. When the model starts to reach an ideal level of performance, the 
gradients should be approaching 0 since the weights would have the cost function reach 
a local minimum, signifying that the differences between the model’s predictions and 
the actual predictions are very small. 

In a process called backpropagation, the gradients are calculated and the weights 
are adjusted for each node in a layer, before the same process is done for the layer 
before that until all of the layers have had their weights adjusted. The entire process of 
passing the data through the model and backpropagating to readjust the weights is what 
comprises the training process of a model in deep learning. 

While the entire training process may sound complicated and computationally 
heavy, GPUs help train the models much quicker because they are optimized to perform 
the matrix calculations required by graphics processing. 

Now that you know more about what deep learning is and how artificial neural 
networks operate, a question might arise on why we should use deep learning for 
anomaly detection. 

First of all, thanks to the advancements in GPU technology, we can train deep 
learning models that are far deeper (many layers with lots of parameters) and on huge 
data sets. This in itself leads to incredible performances by the networks and allows the 
model to have much more powerful applications. 

Not only has this led to a diverse set of models that are each suited for different 
applications (image classification, video captioning, object detection, language 
translation, generative models that can summarize articles, etc.), but the models keep 
getting better and better at their respective tasks. 


83 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


The models are also far more scalable than their traditional counterparts, since 
deep learning models don’t hit a plateau in training accuracy as the number of data 
entries increases, meaning we can apply deep learning models to massive volumes of 
data. This attribute of deep learning models pairs very well with the trend of big data in 
today’s society. 

In this chapter, you will look at applying deep learning models to classifying 
handwritten digits as an introduction to using two great, popular deep learning 
frameworks in Python: Keras, with a TensorFlow backend, and PyTorch. These 
frameworks help you create customized deep learning models in just a few dozen lines 
of code as opposed to creating them entirely from scratch. 

Keras is a high-level framework that lets you quickly create, train, and test powerful 
deep learning models while abstracting all of the little details away for you. PyTorch 
is more of a low-level framework, but it doesn’t carry with it the amount of syntax that 
TensorFlow (a much more popular deep learning framework) does. Compared to Keras, 
however, there are still more things that you must define since it’s no longer abstracted 
away for you. 

Using PyTorch over TensorFlow or vice-versa is more of a personal preference, but 
PyTorch is easier to pick up. Both offer very similar functionality, and if there are any 
functions that TensorFlow has that PyTorch doesn’t, you can still implement them using 
the PyTorch API. 

Another note to make is that TensorFlow has integrated Keras into its API, so if you 
want to use TensorFlow in the future, you can still build your models using tf. keras. 


Intro to Keras: A Simple Classifier Model 


Before you get started, it is recommended that you have the GPU version of TensorFlow 
installed along with all of its dependencies, including CUDA and cuDNN. While they are 
not necessarily required to train deep learning models, having a GPU helps to massively 
reduce training time. Both TensorFlow and PyTorch utilize CUDA and cuDNN to access 
the GPU while training, and Keras runs on top of TensorFlow. 

If you have any questions about Keras, feel free to refer to Appendix A to get a better 
understanding of how Keras works and of the functionality that it offers. 


84 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Here are the exact versions of the necessary Python 3 packages used: 


tensorflow-gpu version 1.10.0 
keras version 2.0.8 

torch version 0.4.1 (this is PyTorch) 
CUDA version 9.0.176 


cuDNN version 7.3.0.29 


You will create, train, and evaluate a deep learning architecture known as a 


convolutional neural network (CNN) in Keras using the MNIST data set. You don’t need 


to download this data set since it is included within TensorFlow. 
The MNIST data set, or the Modified National Institute of Standards and Technology 
data set, is a large collection of handwritten images used to train computer vision and 


image processing models such as the CNN. It is a common data set to start with and is 


basically like the “hello world” data set of computer vision. 


The data set contains 60,000 training images and 10,000 testing images of 


handwritten digits 0-9, each with a dimension of 28x28 pixels. 


First, import all the dependencies (Figure 3-14). 


import tensorflow as tf 

import keras 

from keras.datasets import mnist 
from keras.models import Sequential 


from keras.layers import Dense, Dropout, 
Flatten, Input 


from keras.layers import Conv2D, MaxPooling2D 


from keras import backend as K 


import numpy as np 





Figure 3-14. Importing the modules needed to create the model 


85 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Now define some variables that you will use later (Figure 3-15). 


batch size = 128 
n Classes. = 10 


i epochs = 15 


im row, im col = 28, 28 





Figure 3-15. Variables to use later 


One pass of the entire data set through the model is called an epoch. The batch 
size is how many data entries pass through the model in one iteration. In this case, the 
training data passes through the model 128 entries at a time until all of the entries have 
passed through, marking the end of one epoch. The number of classes is 10 to represent 
each of the 10 digits from 0-9. These variables are also known as hyperparameters, 
parameters that are set before the training process. 

Let’s create your training and testing data sets. One thing to note is that you can use 
data frames, arrays, matrices, etc. in Keras to serve as your data sets. Run the code in 
Figure 3-16. 


(xX Crainy VY Crain), (x est, YY test) = mists load data) 





Figure 3-16. Define the training and testing data sets 


You can use matplotlib to see what one of these images looks like. Run the code in 
Figure 3-17 and see the results in Figure 3-18. 


86 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


import matplotlib.pyplot as plt 


Smatplotlib inline 


plt.imshow(x train[1], cmap='gray') 


pLt. Show () 





Figure 3-17. Importing matplotlib.pyplot to see what these training images 
look like 


import matplotlib.pyplot as plt 
tmatplotlib inline 


mM 
C13 


In 


pit.imshow(x train[i], cmap="gray') 
plt. show () 





0 5 10 15 7 


Figure 3-18. The output of running the code in figure 3-17 


You can enter anywhere from 0 to 59,999 to visualize a sample in x_train. 
Just looking at 10 examples of the digit 1, you can see there is plenty of variation in 


the data set (see Figure 3-19 and Figure 3-20). 


87 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


plt.figure (figsize=(15,10) ) 


an range (0, y trainsshape|0])< 
if(y train[f] == 1 andi < 10): 


DLE SSsubpLOe (Zz; So, aL) 


plt.imshow(x train[f], cmap='gray') 


OlLU,2ticks (i ]) 
OLte.Vyuacks (LiL) 


1=i+1 





Figure 3-19. Code to generate a plot that shows some example images for a 
specific class 


88 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


In [103]: fig = plt. figure (figsize=(15,10)) 


i=0o0 
for f in range(0, y train.shape[0)): 
if(y train(f) == 1 and i < 10): 

plt.subplot(2, 5, i+1) 
plt.imshow(x_train[f£), cmap="gray') 
pit.xticks([]) 
plt.yticks([{)) 
i=i+li 


pit. show () 


4 
i 
. 
' 


Figure 3-20. The output of running the code in Figure 3-19. Notice the amount of 
variation, as well as anomalous data that you would barely consider as numbers 








Now, extend the shape by a dimension. Right now, the dimensions of the training 
and testing sets are as shown in Figure 3-21 and Figure 3-22. 


Princ ("x Train? 1} \nx vests 17 \n".2tormat | 


x train,shape, x Lesl.shape, -)) 





Figure 3-21. Code to output the shapes of the training and testing data sets 
In [71]: print("x train: {}\nx test: {}\n". format ( 
x train.shape, x test.shape, )) 


8, 28) 


Figure 3-22. The output of running the code in Figure 3-21 


89 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


For the purposes of training your model, you want to extend this shape to (60000, 28, 
28, 1) and (10000, 28, 28, 1). 

A property of images is that there are three dimensions for color images and two for 
erey scale images. Grey scale images are simply row x column since they don’t have color 
channels. Color images, on the other hand, can be formatted as row x column x channel 
or channel x row x column. For color images, the variable channel is 3 because you want 
to know the pixel values for red, green, and blue (RGB). 

In this case, it’s grey scale, so you don’t have to worry about the channel variable, but 
the following code will account for both cases if you end up using a data set with color 
such as the CIFAR-10 data set. CIFAR-10 is extremely similar to MNIST, but this time you 
are Classifying the 32x32 images based on labels such as cars, birds, ships, etc. and they 
are in color. Run the code in Figure 3-23. 


tf Kutmage data format() == "channels first": 


x Utain = x UCrain~reéshape (x Lrain.shape|(0];, 1, am row, 
im_ col) 


K test = x Test.1reshape (x test.shape (0), ip im row; 
im Col) 


input. shape = (ly, im row, am _-col) 


else: 


x Erain = x Crain.sreshape (x train.shape![0], am row, 
im Col, . 1) 


KX Lest = X_TéSt.reshape (x tTest.shapel(Ul], am row, am.col, 
1) 


input shape = (im_row, im _col, 1) 





Figure 3-23. Code to reshape the training and testing data sets depending 
on whether or not the channels are first, and then to define the input shape 
of the model 


Now convert the values to float32 and divide by 255. Right now, the values are all 
integer values that range from 0 to 255, but you want to convert those values to float and 
make them 0 to 1. This is a process called normalization, or feature scaling, where you 
attempt to rescale the data to smaller, more manageable values. In this example, you use 
a method called min-max normalization, defined by the formula in Figure 3-24. 


90 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


; xX —Xmin 

= 
Xmax — Xmin 

Figure 3-24, Formula for min-max normalization 


Your values ranged from 0 to 255. For each value, you “subtracted” 0 from x, and 
divided by 255 - 0, which is just 255. Rescaling the pixel values from a range of [0, 255] to 
[O, 1] is common in image tasks and can be done with colored images as well. 

There are other methods, including mean normalization, standardization (z-score 
normalization), and unit length scaling. 

The formulas for each method are as follows: 

Mean normalization (Figure 3-25) 


' X — Xaverage 
— 
Xmax ~ *min 
Figure 3-25. Formula for mean normalization 
This formula is similar to min-max normalization, except you USE Xgverage in the 


numerator OVer X,,;7. 


Standardization (Figure 3-26) 


Figure 3-26. Formula for standardization 


You basically find z-score values for each x and use those instead of the original x values. 
Unit length scaling (Figure 3-27) 


XxX 


|| 


Figure 3-27. Formula for unit length scaling 


I 


=—x 
— 


91 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


You find the unit vector for x and use that instead. Unit vectors have a magnitude of 1. 
The next block of code is shown in Figure 3-28. 


x Urain = = Crainadst yoo Tloet3zZ*) 
X Lest = X-lest.astype ("Lloatsz") 


x train /= 255 


x bese. f= 255 


Vrain = keras-utils.t6. categorically train, n Classes) 


Vy test. = Kerdgs.vtils.to cacegorical(y test, m classes) 





Figure 3-28. Converting x_train and x_test to float32 and applying min-max 
normalization by dividing by 255. For y trainandy test, you convert them toa 
one-hot encoded format 


What keras.utils.to_ categorical() does is take the vector of classes and create a 
binary class matrix of the number of classes. Assume that you have a vector representing 
y_ train with 6 classes at most, going from 0-5 (Figure 3-29). 


Data at index 0 
Data at index 1 


Data at index 2 


& Ue 


Data at index 3 Z 


Figure 3-29. A vector representing y_train that has 6 classes with values ranging 
from 0-5 


After running keras.utils.to categorical(y train, n classes) where 
n_ classes = 5, Figure 3-30 shows what you would now get for y_ train. 


o2 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Data atindexO | 910000 
Dataatindexl1 | OQQQOOL1 
Dataatindex2 | OQO0010 
Dataatindex3 | Q01000 


Figure 3-30. A one-hot encoded representation of the y_train vector in Figure 3-39 


The classes are still the same, but this time you have to get the class by their index 
and not by direct value. At index 1 (row 1 if you think of this as a matrix with 1 column) 
of the original vector, you see that the class label is 5. In your transformed y_train data 
(which is now a matrix), at row 1 (previously index 1 before the transformation), you see 
that everything is a 0 in the vector at that index except for the value at column 5. And so, 
y_ train is still 5 at index 1, but it’s formatted differently. 

Now let’s check the shapes of your transformed data in Figure 3-31 and Figure 3-32. 


print("x train: {}\nx test: {}\ninput shape: {}\n# of training 
samples: {}\n# of testing samples: {}".format ( 


xX Craimeshaps, X Lest. shape, input Shape, x Train.shape [0], 
xX ‘Léestl.shape[0]))) 





Figure 3-31. Print the shapes of the transformed data 


In [126]: print ("x train: {}\nx_test: {}\ninput_shape: {}\n \ 
# of training samples: {}\n# of testing samples: {}".format ( 
x train.shape, x_test.shape, input_shape, x _train.shape([0], x test.shape[0])) 


x train: (60000, 28, 28, 1) 
x test: (10000, 28, 28, 1) 
input shape: (26, 26, 1) 

# of training samples: 60000 
# of testing samples: 10000 


Figure 3-32. The resulting output 


93 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


Note The \ character tells Python that you want to continue to the next line. 
Without it, the code would not run because Python doesn’t see the end of the string 
denoted by the second “, but what \ tells Python is to continue on the next line. 


Now you can move on to defining and compiling your model. 


Run the code in Figure 3-33. 


Figure 3-33. 


94 





model = Sequential () 

model..add{(ConvzD(s2, Kernel. Saze=(5, 3), 
activation='relu', 
input shape=input_ shape) ) 


model.add(Conv2D(64, (3, 3), 
activation='relu') ) 


model. add(MaxPoolingZD( pool. Ssize=(2; -2):)) 
model.add(Dropout (0.25) ) 
model.add(Flatten () ) 

model.add(Dense (128, activation='relu') ) 
model.add(Dropout (0.5) ) 


model.agd (Dense (n. classes, 
activation='softmax') ) 


model. compte loss=keras:losses.cavegorical. . 
CrossenlLropy, 


optimizer=keras.optimizers.Adam(), 


metrics=['accuracy']) 


model.summary () 


Code to define a deep learning model and add layers to it 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


In Keras, the sequential model is a stack of layers. The Conv2D is a two-dimensional 
convolutional layer. 

In convolutional neural networks, a convolution layer filters through the data and 
multiplies each of the values element-wise by the weights in the filter and sums them 
up to generate one value. In this case, it’s a 3x3 filter that slides over each of the pixels to 
generate a smaller layer called an activation map or feature map. This feature map then 
has another filter applied to it in the second convolutional layer to generate another, 
smaller feature map. The weights that are optimized during backpropagation are found 
in the filter. To get a better idea of this, let’s look at some examples of how this works. 


Assume a 5x5 pixel picture like Figure 3-34. 





Figure 3-34, A 5x5 pixel picture, with 0 representing black pixels and I 
representing white pixels 


Assume also that your kernel size (filter dimensions) is 2x2. Figure 3-35 shows how 


the convolutions would go. 


95 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 






Nad ter Weight 
O*-0. eS + bs 
O5 


O*-0.5) + 





Figure 3-35. An example of one multiplication of the 2x2 filter on a 2x2 section 
of the input image. The filter weights are applied element-wise and produce an 
output value that is part of the feature map-the output of this convolutional layer 


To begin with, you have a random set of weights for the 2x2 filter, or kernel. 
The filter goes over the first 2x2 region in the image and sums the element-wise 
multiplication of the values in the filter and the values in the 2x2 region of the image. 
This value is the first element of the feature map, which is a 4x4 layer image. Given an 
nxn filter and mxm image, your feature map dimensions will be an m-n+1 x m-n+1 
dimensional image. In this case, your image is 5x5 and the kernel is 2x2, so the feature 
map is 5-2+1= 4x4 pixels. 

The filter goes through each region in the image pixel by pixel, as shown in 
Figure 3-36. 


96 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


2a Mec Weight 





(0*0.5) 
_ 0.25) + 
*-0.5) + 
(1*0.7 







Feature Map 





Figure 3-36. After the operation in Figure 3-35, the filter moves to the next set of 
data to multiply over, producing the second value in the feature map 


The filter continues doing this until it reaches the right side of the image. After 
that, the filter goes one down and starts again from the left side of the image, like in 
Figure 3-37. 


97 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 





2x2 Filter Weights 
— SS = 
O 0 










ar 
(0*-0.5) + 0.75 
(1*0.75) 





Feature Map 


Eo 0.25 0.25 





Figure 3-37. Showing what happens after the filter reaches the right-most side of 
the image. It moves down one (in this case, at least; you can specify how much you 
want the filter to move as a parameter when calling this layer) and then continues 
its operations as usual 


From here, the filter continues moving right in a pixel by pixel fashion 


(see Figure 3-38). 


98 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 





Figure 3-38. The filter continues moving as normal, adding more values to the 
feature map 


Once it reaches the end, it goes back to the first column and down one row and 
continues its operations until it reaches the bottom right region (see Figure 3-39). 


re 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 





(1*-0.5) + 
(0*0.75) 





Figure 3-39. Once the filter reaches this value here, the convolution operation 
ceases, outputting a feature map to the next layer 


The feature map doesn’t make much sense due to the randomness of the weights. 

After the two convolutional layers, you run into the MaxPooling2D layer. Max 
pooling is where the input data is scanned by a filter, which in this case is a 2x2 filter, and 
the maximum value in the 2x2 region of the image is chosen to be the value in the new 
n-dimensional image. If the stride length is not given, by default Keras chooses the pool 
size. The stride length is how far the filter should shift, and it plays a role in determining 
the feature map size. In this case, since the stride length is 2 and the pooling filter size is 
also 2x2, the dimensions of the input data are reduced in half. 

Assume that the 4x4 image in Figure 3-40 is the input to a max pooling layer with 


pool size of 2x2. 


100 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Filter goes two to the right (stride length is 2) 






Pooling Layer 


Figure 3-40. What a max pooling operation looks like on a 4x4 image 


Since the pool size is 2x2 and the stride length is also 2 in this case (no parameter 
was provided for stride length), the pooling layer happens to split the entire image into 
regions of 2x2 pooling filters. 

If the stride length was 1, then you would have a situation similar to the convolution 
example you saw earlier, and the dimensions of the feature map would be 4-2+1 = 3x3. 
This process of pooling can also be referred to as downsampling. 

The pooling layer helps reduce the size of the data to allow for easier computation. 
Additionally, it can help with pattern identification because the maximum value in each 
region is selected, allowing for the patterns to stand out more. 

The dropout layer is next. Dropout is a regularization technique where a proportion 
(this is a parameter passed in) of randomly selected nodes are “dropped,” or ignored 
during the training process. 

Flatten is a layer where the entire input is squashed into one dimension. Assume 
that you are trying to flatten a 3x3 image, like Figure 3-41. 


101 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Input to flatten layer 





Flatten layer output 


Figure 3-41. Showing what a flatten layer does to an input 3x3 image 


The dense layer is simply a layer of regular nodes similar to those in the artificial 
neural network example. They perform in the same way, but in this case the number of 
nodes varies from 128 in the first dense layer and to 10 in the second dense layer. The 
activation function also changes, from ‘relu, or rectified linear unit (ReLU) in the first 
dense layer, to softmax in the second. 

Mathematically, the ReLU function is defined as y = max(0, x), so when the node 
calculates the dot products between the input and the weights and adds the bias, it 
simply outputs whatever is bigger between 0 or the calculation. 

The graph for ReLU looks like Figure 3-42. 


102 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 





y=xforallx>0O 


y=Oforallx<0O 


Figure 3-42. A graph showing the ReLU function 


The general formula for softmax is shown in Figure 3-43. 


x; 


O(x*)i= aR Fori=1...,K and x = (x, ..., x) € R* 
j=l ej 


Figure 3-43. Formula for the softmax activation function 


As for the optimizer, it is set to the Adam optimizer, a type of gradient-based 
optimizer. By default, the parameter known as the learning rate is set to 0.001. Recall 
that the learning rate helps determine the step size taken by the optimization algorithm 
to see how much to adjust the weights by. 

After executing the code in Figure 3-43, you get the output in Figure 3-44. 


103 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


In [128]: 1 model = Sequential () 
2 model.add(Conv2D(32, kernel _size=(3, 3), 
3 activation='relu', 
input_shape=input_shape) ) 
model.add(Conv2D(64, (3, 3), activation=‘relu')) 
model .add(MaxPooling2D(pool size=(2, 2))) 
model .add (Dropout (0.25) ) 
2 model.add(Flatten ()) 
model.add(Dense(1i28, activation='relu')) 
10 model.add(Dropout (0.5) ) 
il model.add(Dense(n classes, activation='softmax")) 


iO —J ov wn 


model .compile (loss=keras.losses.categorical crossentropy, 
14 optimizer=keras.optimizers.Adam(), 
is metrics=["accuracy']) 


7 model.summary () 

Layer (type) «Output Shape = ss Param 
conv2d_17 (Conv2D) = (None, 26, 26, 32) = 320 
max pooling2d 6 (MaxPooling2 (None, 12, 12, 64) 0 
dropout 11 (Dropout) (None, 12, 12, 64) 0 
flatten 6 (Flatten) (None, 9216) 0 

dense 10 (Dense) = (None, 128) 1179776 | 
dropout 12 (Dropout) (None, 128) 0 

dense ii (Dense) (None, 10) 1290 


Total params: 1,199,882 
Trainable params: 1,199,882 
Non-trainable params: 0 


Figure 3-44. The output for the code in Figure 3-33. Note how it tells you the 
output shapes of each layer and the number of parameters; this can be useful when 
creating custom models and finding out that there is a mismatch between the 
dimensionality of what a layer expects and what it actually receives 


104 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Now let’s move on to training the data. Depending on your setup, this can take 
anywhere from a few seconds to several minutes. Without cuda, expect that this will take 
much longer. 

Run the code in Figure 3-45. 


checkpoint = ModelCheckpoint (filepath="keras MNIST CNN.h5", 
verbose=0O, 


Save. O6sU. On. y= True) 


model. fic (X: Grainy, Y crain, 
Dalch: S17e-baLlch size, 
epochs=n epochs, 


verbose=l, 


Validation datea=(xX Test, YY Test)» 


callbacks=[checkpoint] ) 


score = model,cvalualte (x. lest, y lest, verbose=0} 
print('Test loss:', score[Q]) 


print('Test accuracy:', score[1]) 





Figure 3-45. Code to train the model and print accuracy and loss values for the 
test set 


105 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


The variable checkpoint will store the model in the same folder as this code with 
the name keras_MNIST_CNN.h5. If you don’t want to save the model, run the code in 
Figure 3-46 instead. 


model. TLu(x Crain, y train, 
batch. Size=bacch Size, 
epochs=n1 epochs, 


verbose=l, 


Validation dala=(x Test, y test)) 


score = modelsevaluate(x test, y test, verbose=0) 
print('Test loss:', score[0Q]) 


print ('Test accuracy:', score[1]) 





Figure 3-46. Run this code if you don’t want to save the model 


106 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


If successful, you should see something like Figure 3-47. 


In [265]: 1 checkpoint = ModelCheckpoint (filepath="keras MNIST_CNN.h5", 
verbose=0, 
save_best_only=True) 


5 model.fit(x train, y train, 
é batch_size=batch_size, 
epochs=n_epochs, 
é verbose=1, 
9 validation data=(x test, y test), 
10 callbacks=[checkpoint]) 


12 score = model.evaluate(x test, y test, verbose=0) 
13 print('Test loss:', score[0]) 
i¢ print('Test accuracy:', score[1)) 


Train on 60000 samples, validate on 10000 samples 





Epech 1/15 

60000/60000 [(s==s=s=s=s=s=s=ss=sssseseseseseses=)] - 9s - loss: 0.2444 - acc: 0.9258 - val_loss: 0.0485 - val_acc: 0.9844 
Epech 2/15 

60000/60000 - 78 - loss: 0.0878 - acc: 0.9743 - val_loss: 0.0391 - val_acc: 0.9869 
Epoch 3/15 

60000/60000 - 78 - loss: 0.0625 - acc: 0.9815 - val_loss: 0.0330 - val_acc: 0.9893 
Epech 4/15 

60000/60000 - 7s - loss: 0.0531 - ace: 0.9844 - val_loss: 0.0322 - val_ace: 0.9896 
Epech 5/15 

60000/60000 - 78 - loss: 0.0449 - acc: 0.9865 - val_loss: 0.0329 - val_acc: 0.9889 
Epoch 6/15 

60000/60000 - 7s - loss: 0.0410 - ace: 0.9875 - val _ loss: 0.0281 - val _ ace: 0.9914 
Epoch 7/15 

60000/60000 - 7s - loss: 0.0345 - ace: 0.9889 - val_loss: 0.0283 - val_ace: 0.9910 
Epoch 8/15 

60000/60000 - 7s - loss: 0.0314 - acc: 0.9902 - val_loss: 0.0265 - val_acc: 0.9926 
Epech 9/15 

60000/60000 - 7s - loss: 0.0293 - acc: 0.9907 - val_loss: 0.0280 - val_acc: 0.9920 
Epech 10/15 

60000/60000 - 7s - loss: 0.0256 - acc: 0.9917 - val_loss: 0.0290 - val_acc: 0.9917 
Epech 11/15 

60000/60000 (=====s=========s=s=s=ss========) - 78 - loss: 0.0234 - acc: 0.9927 - val_loss: 0.0269 - val_acc: 0.99260.9 
Epoch 12/15 

60000/60000 (=====s==s==s=ss=ssssssssssssssss===] - 7s - loss: 0.0230 - acc: 0.9925 - val_loss: 0.0267 - val _ acc: 0.9923 
Epech 13/15 

60000/60000 (==s=s=s=s=s=s=ssssssssss======) - 73 - loss: 0.0197 - acc: 0.9933 - val_loss: 0.0299 - val_ace: 0.9916 
Epoch 14/15 

60000/60000 - 7s - loss: 0.0202 - ace: 0.9931 - val_loss: 0.0286 - val_acec: 0.9932 
Epech 15/15 

60000/60000 - 7s - loss: 0.0179 - acc: 0.9938 - val_loss: 0.0261 - val_acc: 0.9934 





Test loss: 0.026139157172246224 
Test accuracy: 0.9934 


Figure 3-47. The output of running the training function, accompanied by the loss 
and accuracy values for the test set 


Let’s check the AUC score for this. Run the code in Figure 3-48. 


107 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


from Skleéarnsmstrics Ampere Loc auc Score 


preds. = model.predicr(s vest) 
auc. = DOC auc. SCOre (np.round.(preds), y test) 


Drant("AUC? {2326} ".tOrmat (auc). ) 





Figure 3-48. Code to generate the AUC score for this model based on the test set 


Basically, the variable predictions are a list of arrays with 10 elements, each containing 
the probability values for class predictions for each of the x_test data samples. 

To check the values for the predictions before doing np. round(), run the code in 
Figure 3-49 and see the results in Figure 3-50. 


preds = model.predict (x_test) 


print ("Predictions for x_test[0]: {}\n\nActual label for x_test[0]: 
{}\n".format(preds[0], y_test[0])) 


print ("Predictions for x_test[0] after rounding: 
{}\n". format (np. round (preds) [0]) ) 





Figure 3-49. Code to see what the predictions actually look like before rounding 
them 


108 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


2 preds = model.predict (x test) 
print ("Predictions for x_test[0]: {}\n\nActual label for x test[0]: {}\n".format(preds[0}, y test[0))) 
print ("Predictions for x _test[0] after rounding: {}\n". format (np. round(preds) [0])) 


Predictions for x test[(0): [4.1195924e-19 4.888474le-14 1.1587565e-13 1.5126733e-13 1.3377293e-15 
7.9817291e-17 2.9398691e-23 1.0000000e+00 5.9716662e-15 1.5278325e-13) 


Actual label for x_test[0]: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.) 


Predictions for x test[0) after rounding: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.) 


Figure 3-50. The output for running the code in Figure 3-49 


The data values for the predictions for every other class besides the one it predicts 
correctly are so small that rounding them off is insignificant. The AUC score is shown in 
Figure 3-51. 


In [266]: from sklearn.metrics import roc auc score 
preds = model.predict(x test) 


auc = roc auc score(np.round(preds), 
print ("AUC: {:.2%}".format (auc) ) 


y test) 


2 


AUC: 99.64% 


Figure 3-51. The generated AUC score for the model. This is the output of running 
the code in Figure 3-48 


That’s a really good AUC score! This score indicates that this model is really good at 
identifying handwritten digits, provided they’re in a similar format to the MNIST data set 
you used during training. 

Referring back to the convolutional layers, let’s run some code to see what the feature 
maps look like after the first two convolutional layers compared to the original image. 

Run the code in Figure 3-52 and look at the output in Figure 3-53. 


109 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


from keras import models 


layers = [layer.output for layer in model.layers[:4]] 
model layers = models.Model (inputs=model.input, outputs=layers) 


aCliveacitons — model layers: predicl(x train) 


= Dlt.frigure (f£igsize=(15, 10) 


ssubploc(tl, 3, 1) 

.title ("Original") 

.imshow(x train[7].reshape(28, 28), cmap='gray') 
sxtCacke t[ |) 


sVeucks (|) ) 


f in range(l, 3): 
PLltestbolor tl, 3, trl) 


plt.title("Convolutional layer %d" % £) 


layer activation = activations (tf ] 


plt.imshow(layer activation[7, :, =<, cmap='gray') 


pLesxtaicks ([ 1) 


plt.ycicks ([]) 


plt.show () 





Figure 3-52. Code to generate graphs of what the images look like at various 
stages of the model 


110 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


In [259]: from keras import models 


layers = [layer.output for layer in model. layers([:4)] 
model layers = models.Model(inputs=medel.input, outputs=layers) 
activations = model layers.predict (x_train) 


fig = plt. figure (figsize=(15,10)) 


plt.subplot(1, 3, 1) 

plt.title ("Original") 
plt.imshow(x_train[7).reshape(2&, 28), cmap="gray'") 
plt.xticks([]) 

plt.yticks({]}) 


for f in range(l1, 3): 
plt.subplot(i, 3, £+1) 
plt.title ("Convolutional layer dd" % £) 
layer activation = activations[f] 
plt.imshow(layer activation|7, :, :, 0], cmap="gray") 
plt.xticks([]) 
plt.yticks([]) 


pit. show () 


Original Convolutional layer 1 Convolutional layer 2 











Figure 3-53. The output of running the code in Figure 3-52 


As the image passes through the convolutional layers, its dimensions get reduced 
and the patterns become more apparent. While to us that might not look so much 
like a three, the model identifies those patterns from the original image and bases its 
prediction on that. 

So now you have a much better understanding of what a CNN is and how Keras can 
be used to easily create and train your very own deep neural network. If you would like to 
explore the framework further, feel free to check out Appendix A. If you have any further 
questions, or would like to explore Keras beyond what’s in Appendix A, check out the 


official Keras documentation. 


Intro to PyTorch: A Simple Classifier Model 


Now that you have a better idea of what a CNN is and how a classifier model looks like in 


Keras, let’s jump straight into implementing a CNN in PyTorch. 


111 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


PyTorch doesn’t abstract everything to the extent that Keras does, so there’s a bit 
more syntax involved. If you would like to explore this framework further, check out 
Appendix B, where we cover the basics of PyTorch, its functionality, and apply it to the 
models that you will explore in Chapter 7. 

Just like in Keras, however, you start by importing the necessary modules and 
defining your hyperparameters (Figure 3-54). 


import torch 

import torch.nn as nn 

import torchvision 

import torchvision.transforms as transforms 
import torch.optim as optim 

import torch.nn.functional as F 


import numpy as np 


#If cuda device exists, use that. If not, default to CPU. 


device = Torch.device(’cuda:0” at torch.cuda.1s dvarilable() else "cou*) 





Figure 3-54. Code to import the modules you need and to define the device 
(CPU or GPU) to run PyTorch on 


In PyTorch, you must specify to torch that you want to use the GPU if it exists. In 
Keras, since you are using tensorflow-gpu as the back end (what Keras runs on top of), it 
is expected that you have a GPU, CUDA, and cuDNN installed. 

Now configure your hyperparameters (Figure 3-55). 


112 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


#Hyperparameters 


num Spocns: = 15 


num classes = 10 


Davrch. size =. 126 


learning race = "0.001 





Figure 3-55. Code to define the hyperparameters to use 


In this example, you will match the model architecture used in the example for Keras 
as best as PyTorch allows you to. Not every function is equivalent between TensorFlow 
and PyTorch, but the vast majority of them are. 

Now create your testing and training data sets (Figure 3-56). 


#Load MNIST data set 

train dataset = torchvision.datasets.MNIST (root='../../data/', 
train=True, 
transform=transforms.ToTensor(), 


download=True) 
Lest Cacaeset TOPCHV1S10N, dacasete«MNIGT (LOCUS. «/ss/data/", 
train=False, 


transform=transforms.ToTensor () ) 


#Data loader 


teaim loader vorch.~utils.date.~Datlaloader (datasct=train dataser, 


batch .s176e=bacch size, 


shuffle=True) 


test loader = torch.utils.data.DataLoader(dataset=test dataset, 
batch size=bacch Size, 


shuffle=False) 





Figure 3-56. Using DataLoaders, a feature of PyTorch, to get the training and 
testing data 
113 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


The procedure for loading the MNIST data might be a bit different in PyTorch, using 
data loaders instead of data frames, but you can still use data frames, arrays, and so on 
in PyTorch after converting them to tensors. The procedure is usually to convert the data 
frame to a numpy array and then to a PyTorch tensor. 

Let’s move on to creating your model (Figure 3-57). 


class CNN(nn.Module): 
def i =init (self): 
SUpPer(CNN, Selt). init () 
selrt.convil = nn«ConvZdil, 32; 3; 
self.conv2 nn.ConvzZd(32, 64, 3, 1) 
self.densel nn.Linear (12*12*64, 128) 


Seli .densez nn.Linear (128, num classes) 


forward(self, x): 

xX F.relu(self.convl (x) ) 

xX F.relu(self.conv2 (x) ) 
F.max poolzZd(x, 2, 2) 
F.GQrop0ul (x, 0.25) 
X.view(-1, 12%*12%*64) 
P.relu(selt.densel (x) ) 

xX F.Qropouci(x, 0.5) 


x self.dense? (x) 


return F.log softmax(x, dim=1) 





Figure 3-57. Creation of a convolutional neural network in PyTorch 


114 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


The procedure is a bit different than in Keras. In this example, the major layers were 
defined under _init__, which are your two convolutional layers and the two dense 
layers. The rest of the layers are defined under forward(). In forward(), you set x equal 
to the output of the activation function of the first convolutional layer. This new x is 
now the input of the next convolutional layer, and you set x equal to the output of the 
activation function of the second convolution layer. This same process repeats for the 
other layers, but the exact flow of data can be a bit confusing, so Figure 3-58 shows an 
example of what this code actually does. 


xX = F.relu(self.convl (x) ) 








Figure 3-58. Erelu is f(x), x is the training data, and self.conv1 is the first 
convolutional layer 


The original inputs of x, self.convl, and Erelu can be shown as such. x passes into the 
convolutional layer, and the outputs of that layer pass through the ReLU function. Then 
you get your final output X’ (Figure 3-59). 


115 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


Xx = F.relu(self.convl (x) ) 





x (training data) _ 


X’ is the x left of the equal sign, 
X_train Is x in self.conv1(x), 

f() is E.relu(), and Conv1. is 
self.conv() 


Figure 3-59. The outputs of f(x) are now the new x. Basically, x = f(x). In this case, 
the output x’ is the new x 


Now, X is X; and this new X gets passed onto the next layer (Figure 3-60). 
The new x we use in the next layer 


xX = F.relu(self.convl(x)) x = F.relu(self.conv2 (x) ) 


x (training data) X’ (the new x) 


“ 


f Convi —+X f Conv2 ——»> X” 


Figure 3-60. The new x is now the new input for the next convolutional layer 


116 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


The same process repeats again, except with the new value of X (Figure 3-61). 


xX = F.relu(self.conv2 (x) ) 





- ¥’ (the new x) 


Conv2 x” 





Figure 3-61. The same process repeats, leading to a new value for x 


And now you get the new output X’ (Figure 3-62). 


xX = F.relu(self.conv2 (x) ) 








t Conv2 Xx" 


Figure 3-62. Once again, you redefine x as x". The process then continues for as 
many layers as you have in the network 


This new output X” is then the new value of X, and the process continues. 


117 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


This is the same logic behind the rest of the code, where the output of the activation 
layer for the old is now the new definition of x. This new x then goes to the next layer, 
where a function is applied after it goes through a layer and then that data becomes the 
new definition of x, and so on. 

SO 


X = X.View(-1, 12*12*64) 


performs the same function as the flatten layer in the Keras example. 
Now you can move on to training your data (Figure 3-63). 


model = CNN().to (device) 
criterion = nn.CrossEntropyLoss () 


optimizer torch.optim.Adam(model.parameters(), lr=learning rate) 


total step = len(train. loader) 
for epoch in range(num_ epochs): 
for i, (images, labels) im enumerate(train loader): 
images = images.to (device) 


labels labels.to (device) 


# Forward pass 
outputs = model (images) 


loss = criterion(outputs, labels) 


# Backward and optimize 


OpLIMLZer. zero Orad() 


loss.backward () 


optimizer.step () 


Z£ (i+1) % 100 == 
Print. ("Epoch [{}/i}1. Step [tiv itl,. Doss: 42.42)" 


sEOrmhat (epoch+l, mum. epochs, itl, Total step, 
loss.item())) 





Figure 3-63. You initialize the model, your loss function, and your optimizer, and 
then you start the training process 


118 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


It might take a while, but you should see something like Figure 3-64. 


In [85]: 1 model = CNN().to(device) 
2 criterion = nn.CrossEntropyLoss() 
3 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) 


5 total_step = len(train_loader) 
6 for epoch in range(num_epochs): 
7 for i, (images, labels) in enumerate(train_ loader): 
= images = images.to(device) 
C labels = labels.to(device) 


11 # Forvard pass 
12 outputs = model (images) 
] loss = criterion(outputs, labels) 


15 # Backvard and optimize 
16 optimizer.zero grad() 
17 loss.backward() 
optimizer.step() 


oO oD 


20 if (i+1) % 100 == 0: 
21 print ("Epoch [{}/{}], Step [{}/{}], Loss: {:.4£}' 
22 -format(epoch+i, num_epochs, i+i, total_step, loss.item())) 


Epoch [1/15], Step [100/469], Loss: 0.1666 
Epoch [1/15], Step [200/469], Loss: 0.2753 
Epoch [1/15], Step [300/469], Loss: 0.2462 
Epoch [1/15], Step [400/469], Loss: 0.1169 
Epoch [2/15], Step [100/469], Loss: 0.0327 
Epoch [2/15], Step [200/469], Loss: 0.0238 
Epoch [2/15], Step [300/469], Loss: 0.0293 
Epoch [2/15], Step [400/469], Loss: 0.0598 
Epoch [3/15], Step [100/469], Loss: 0.0179 
Epoch [3/15], Step [200/469], Loss: 0.0577 
Epoch [3/15], Step [300/469], Loss: 0.0275 
Epoch [3/15], Step [400/469], Loss: 0.0228 
Epoch [4/15], Step [100/469], Loss: 0.0051 
Epoch [4/15], Step [200/469], Loss: 0.0139 
Epoch [4/15], Step [300/469], Loss: 0.0048 
Epoch [4/15], Step [400/469], Loss: 0.0033 
Epoch [5/15], Step [100/469], Loss: 0.0081 
Epoch [5/15], Step [200/469], Loss: 0.0044 
Epoch [5/15], Step [300/469], Loss: 0.0084 
Epoch [5/15], Step [400/469], Loss: 0.0011 
Epoch [6/15], Step [100/469], Loss: 0.0077 


Figure 3-64. The output of the training process 


After training is done, you can test your model and find the AUC score (Figure 3-65). 


119 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


from sklearn.metrics import roc auc score 


preds = [] 
y true = [] 
# Test the model 
model.eval () # Set model to evaluation mode. 
with torch.no grad({) : 
correct = 0 
total = 0 
for amages; labels in. test loader: 
images images.to (device) 
labels labels.to (device) 


outputs = model (images) 


_»y Predicted = Torch.stax (OulpulSs.data, 1) 


total += labels.size(0) 
correct += (predicted == labels) .sum().item() 
detached pred = predicted.derach ().~cpout) .numpy () 
detached label = .abels.detachn()«cpu().niumpy () 
for f in range(0, len(detached pred)): 

DPreds. append (detached. predlt]) 


y true.appendidetacned label (tl) 


print('Test Accuracy of the model on the 10000 test images: 
{+.22)}'.format (correct total) ) 
preds = np.eye(num classes) [preds] 
y true = np.sye (num classes) |y true] 
auc = roc auc sCore(preds;, y true) 
Princ ("AUCs {2.25} "elormat (aic))) 
# Save the model checkpoint 


Corch. save (model, stave Gictl(), 'pyterch mnist cnn.ckpr*) 





Figure 3-65. Code to evaluate the model and generate the AUC score 


120 


CHAPTER 3 INTRODUCTION TO DEEP LEARNING 


The resulting output is shown in Figure 3-66. 


i ee 
Epoch [15/15], Step [300/469], Loss: 0.0131 
Epoch [15/15], Step [400/469], Loss: 0.0002 


In [87]: 1 from sklearn.metrics import roc auc score 


preds = [] 
y true = [) 
# Test the model 
model.eval() # Set model to evaluation mode. 
with torch.no grad(): 
correct = 0 
tetal = 0 
for images, labels in test loader: 
images = images.to (device) 
labels = labels.to (device) 
outputs = model (images) 
_, predicted = torch.max(outputs.data, 1) 
total += labels.size([(0) 
1é correct += (predicted == labels) .sum{).item() 
17 detached pred = predicted.detach().cpu() .numpy() 
detached label = labels.detach() .cpu() .numpy() 
for £ in range(0, len(detached pred)): 
preds. append (detached_pred[f]) 
y_true.append (detached label[f]) 


print('Test Accuracy of the model on the 10000 test images: {:.2%}'.format(correct / total)) 
preds = np.eye(num classes) [preds] 
y_true = np.eye (num_classes) [y_true] 
auc = rec auc score(preds, y true) 
print ("AUC: {:.2%}".format (auc) ) 
# Save the model checkpoint 
terch.save (model.state_ dict(), "pytorch_mnist_cnn.ckpt") 
Test Accuracy of the model on the 10000 test images: 99.07% 
AUC: 99.48% 


Figure 3-66. The generated accuracy score on the test set and the AUC score for 
the model 


Now you a bit more about how to create and train your own CNN in PyTorch. 
PyTorch is a bit harder to learn than Keras, which aims to make everything quite 
readable and simple, having abstracted all of the more complicated bits of code. 
TensorFlow and PyTorch are both low-level APIs that require more code to be written 
because of the lack of abstraction, but offer more flexibility in controlling exactly how 
you want everything to be. Between the two, PyTorch is easier to debug if you're using 
the debugging tool in PyCharm. In the end, it’s all a matter of preference, although 


TensorFlow and PyTorch both perform faster on larger data sets. 


121 


CHAPTER 3. INTRODUCTION TO DEEP LEARNING 


If you would like to explore PyTorch further, check out Appendix B, where we cover 
a more refined way to create models, train, and test, as well as the general functionality 
that PyTorch offers. Appendix B also applies PyTorch to the models in Chapter 7, which 
are done in Keras. 

If you would like to learn more about PyTorch after visiting Appendix B, check out 
the official PyTorch documentation. 


Summary 


In recent years, deep learning has revolutionized an incredible variety of fields. Thanks 
to deep learning, we now have self-driving cars, models that have beaten professionals in 
detecting certain cancers, instant translation between languages, etc. It is of no surprise, 
then, that deep learning has also contributed heavily to the field of anomaly detection. 

In this chapter, we discussed what deep learning is and what an artificial neural 
network is. You explored two popular frameworks, Keras and PyTorch, by applying them 
to the task of image classification with the MNIST data set. 

In the upcoming chapters, we will take a look at the applications to anomaly 
detection of the following types of deep learning models: autoencoders, restricted 
Boltzmann machines, RNN/LSTM networks, and temporal convolutional networks. 

In the next chapter, we will look at unsupervised anomaly detection with 
autoencoders. 


122 


CHAPTER 4 


Autoencoders 


In this chapter, you will learn about autoencoder neural networks and the different types 
of autoencoders. You will also learn how autoencoders can be used to detect anomalies 
and how you can implement anomaly detection using autoencoders. 

In a nutshell, the following topics will be covered throughout this chapter: 


e What are autoencoders? 

e Simple autoencoders 

e Sparse autoencoders 

e Deep autoencoders 

e Convolutional autoencoders 
e Denoising autoencoders 


e Variational autoencoders 


What Are Autoencoders? 


In the previous chapter, you learned about the basic functioning of a neural network. 
The basic concept is that a neural network essentially computes a weighted calculation 
of inputs to produce outputs. The inputs are in the input layer and the outputs are in 

the output layer and there are one or more hidden layers between the input and output 
layers. Back propagation is a technique used to train the network while trying to adjust 
the weights until the error is minimized. Autoencoders use this property of a neural 
network in a special way to accomplish some very efficient methods of training networks 
to learn normal behavior, thus helping to detect anomalies when they occur. Figure 4-1 
shows a typical neural network. 


123 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_4 


CHAPTER 4 AUTOENCODERS 


input Layer 


Hidden Layer 





Output Layer 


input Data Pants 
sINAGINO 


Weignts Wij 


Figure 4-1. A typical neural network 


Autoencoders are neural networks that have the ability to discover low-dimensional 
representations of high-dimensional data and are able to reconstruct the input from the 
output. Autoencoders are made up of two pieces of the neural network, an encoder and 
a decoder. The encoder reduces the dimensionality of a high dimensional dataset to a 
low dimensional one whereas a decoder essentially expands the low-dimensional data 
to high-dimensional data. The goal of such a process is to try to reconstruct the original 
input. If the neural network is good, then there is a good chance of reconstructing the 
original input from the encoded data. This inherent principle is critical in building an 
anomaly detection module. 

Note that autoencoders are not that great if you have training samples containing 
few dimensions/features at each input point. Autoencoders perform well for five or more 
dimensions. If you have just one dimension/feature then, as you can imagine, you are 


just doing a linear transformation, which is not useful. 


124 


CHAPTER 4 AUTOENCODERS 


Autoencoders are incredibly useful in many use cases. Some popular applications of 
autoencoders are 


1. Training deep learning networks 
2. Compression 

3. Classification 

4. Anomaly detection 


5. Generative models 


Simple Autoencoders 


Of course, we will focus on the anomaly detection piece in this chapter. Now, an 
autoencoder neural network is actually a pair of two connected sub-networks, an 
encoder and a decoder. An encoder network takes in an input and converts it into a 
smaller, dense representation, also known as a latent representation of the input, which 
the decoder network can then use to convert it back to the original input as much as 
possible. Figure 4-2 shows an example of an autoencoder with encoder and decoder 
sub-networks. 


Autoencoder 


Encoder —————* |__Latent/Compressed Decoder 
Representation a 


Figure 4-2. A depiction of an autoencoder 


Autoencoders use data compression logic where the compression and 
decompression functions implemented by the neural networks are lossy and are mostly 
unsupervised without much intervention. Figure 4-3 shows an expanded view of an 
autoencoder. 


125 


CHAPTER 4 AUTOENCODERS 


Input Layer 
put Layer Output Layer 


Hidden Layer 





Inpul Data Points 
sinding 





Weights Wij 
Vreights Wyk 


Encoder Network | | 


Decoder Network 


Figure 4-3. Expanded view of an autoencoder 


The entire network is usually trained as a whole. The loss function is usually either 
the mean-squared error or cross-entropy between the output and the input, known as 
the reconstruction loss, which penalizes the network for creating outputs different from 
the input. Since the encoding (which is simply the output of the hidden layer in 
the middle) has far less units than the input, the encoder must choose to discard 
information. The encoder learns to preserve as much of the relevant information as 
possible in the limited encoding and intelligently discards the irrelevant parts. The 
decoder learns to take the encoding and properly reconstruct it back into the input. If 
you are processing images, then the output is an image. If the input is an audio file, the 
output is an audio file. If the input is some feature engineered dataset, the output will be 
a dataset too. We will use a credit card transaction sample to illustrate autoencoders in 
this chapter. 


126 


CHAPTER 4 AUTOENCODERS 


Why do we even bother learning the presentation of the original input only to 
reconstruct the output as well as possible? The answer is that when we have input with 
many features, generating a compressed representation via the hidden layers of the 
neural network could help in compressing the input of the training sample. So when the 
neural network goes through all the training data and fine tunes the weights of all the 
hidden layer nodes, what will happen is that the weights will truly represent the kind of 
input that we typically see. As a result of this, if we try to input some other type of data, 
such as having data with some noise, the autoencoder network will be able to detect the 
noise and remove at least some portion of the noise when generating the output. This is 
truly fantastic because now we can potentially remove noise from, for example, images 
of cats and dogs. Another example is when security monitoring cameras capture hazy 
unclear pictures, maybe in the dark or during adverse weather, causing noisy images. 

The logic behind the denoising autoencoder that if we have trained our encoder 
on good, normal images and the noise when it comes as part of the input is not really a 
salient characteristic, it is possible to detect and remove such noise. 

Figure 4-4 shows the basic code to import all necessary packages in a Jupyter 
notebook. Note the versions of the various packages. 


127 


CHAPTER 4 AUTOENCODERS 


import keras 

from keras import optimizers 

from keras import losses 

from keras.models import Sequential, Model 

from keras.layers import Dense, Input, Dropout, Embedding, LST 
from keras.optimizers import RMSprop, Adam, Nadam 

from keras.preprocessing import sequence 

from keras.callbacks import TensorBoard 


import sklearn 

from sklearn.preprocessing import StandardScaler 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import confusion_matrix, roc_auc_score 
from sklearn.preprocessing import MinMaxScaler 


import seaborn as 5ns 
import pandas as pd 
import numpy as np 
import matplotlib 


import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 
mmatplotlib inline 


import tensorflow 
import sys 
print("Python: ", sys.version) 


print("pandas: ", pd. version_) 
print("“numpy: ", np.__version_) 
print("seaborn: ", sns.__version_) 
print("“matplotlib: ", matplotlib._version_) 
print("sklearn: ", sklearn._version_ 
print("Keras: ", keras.__version__) 
print("Tensorflow: ", tensorflow._version_) 


Using TensorFlow backend. 


Python: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] 
pandas: @.24.2 

numpy: 1.16.3 

seaborn: 2.9.0 

matplotlib: 3.0.3 

sklearn: 0.20.3 

Keras: 2.2.4 

Tensorflow: 1.13.1 


Figure 4-4. Importing packages in a Jupyter notebook 


Figure 4-5 shows the code to visualize the results via a confusion matrix, a chart for 
the anomalies and a chart for the errors (the difference between predicted and truth) 
while training. It shows the Visualization helper class. 


128 


CHAPTER 4 AUTOENCODERS 


class Visualization: 
labels = [“Normal", “Anomaly"] 


def draw_confusion_matrix(self, y, ypred): 
matrix = confusion_matrix(y, ypred) 


plt.figure(figsize-(10, 8)) 

colors=-[ “orange”, "green" ] 

sns.heatmap(matrix, xticklabels<self.labels, yticklabels=self.labels, cmap=colors, annot=-True, fmt="d") 
plt.title("Confusion Matrix”) 

plt.ylabel( 'Actual") 

plt.xlabel( 'Predicted') 

plt.show() 


def draw_anomaly(self, y, error, threshold): 
groupsOF = pd.DataFrame({‘error': error, 
"true": y}).groupby("true*) 


figure, axes = plt.subplots(figsize=(12, &)) 


for name, group in groupsDF: 
axes.plot(group.index, group.error, marker='x' if name == 1 else ‘o', linestyle-'', 
color='r‘' if name == 1 else ‘g', label="Anomaly” if name == 1 else "Normal") 


axes.hlines(threshold, axes.get_xlim()[6], axes.get_xlim()[1], colors="b", zorder=106, label="Threshold') 
axes. legend( ) 


plt.title(“Anomalies") 
plt.ylabel("Error”) 
plt.xlabel("Data") 
plt.show() 


def draw_error(self, error, threshold): 
plt.plot(error, marker="0", m5=-3.5, linestyle=-"", 
label="Point") 
plt.hlines(threshold, xmin=@, »max=len(error)-1, colors="b", zorder=106, label='Threshold') 
plt.legend() 
plt.title("Reconstruction error”) 
plt.ylabel("Error”) 
plt.xlabel("Data") 
plt.show() 


Figure 4-5. Visualization helpers 


You will use the example of credit card data to detect whether a transaction is 
normal/expected or abnormal/anomaly. Figure 4-6 shows the data being loaded into a 
Pandas dataframe. 


129 


CHAPTER 4 AUTOENCODERS 


filePath = ‘./creditcardanomalydetection.csv" 
df = pd.read_csv(filepath_or_buffer-filePath, header-0, sep=',") 
print(df.shape[@]) 


df. head( ) 
284567 

Time v1 v2 W2 Va VS v6 vi ve V3... V24t W22 W23 Vid v2 
0 OO -1.350007 -0<:.072731 253047 1.378155 -0.330321 0.402388 O230500 0.008608 O.363787 .. 0.018307 O2778 ~O.110474 0.000028 0.12853 
1 O08 T.1R18S7 0.260151 O100480 0448134 0.000018 -0.082301 ~DOTEEOS 0.085102 02550425 .. -O.225775 -O.638872 0.101288 -O.300848 = O.16717 
2 1.0 -1.358354 -1.340163 1.773200 O.370780 -0.503108 1.200400 O701481 O247678 -1514654 .. O247008 OCTTIGTS O.000412 -0.650281 -0.32764 
3 10 OR66Z72 0.195228 1.702003 -0.863201 -0.010300 1247203 O237609 O.377438 -1.387024 .. -0.108300 0.005274 ~.100321 -1.175575 0.64737 
4 2.0 -1.158233 O.877737 1.848718 0.403034 -O.407103 0.005021 0.502041 -0.270533 O.817730 .. -0.000431 O.708278 +~0.137458 0.141267 -0.20601 


2 fows * 31 columns 
Figure 4-6. Examining the Pandas dataframe 


You will collect 20k normal and 400 abnormal records. You can pick different ratios 
to try, but in general more normal data examples are better because you want to teach 
your autoencoder what normal data looks like. Too much abnormal data in training 
will train the autoencoder to learn that the anomalies are actually normal, which goes 
against your goal. Figure 4-7 shows sampling the dataframe and choosing the majority of 


normal data. 


df['amount'] = Standardscaler().fit_transform(df['amount'].values.reshape(-1, 1)) 
df@ = df.query('Class == 6').sample(2e0ee) 

df1 = df.query('Class == 1').sample(4ee) 

df = pd.concat([dfté, dfi]}) 


Figure 4-7. Sampling the dataframe and choosing the majority of normal data 


You split the dataframe into training and testing data sets (80-20 split). Figure 4-8 
shows the code to split the data into the train and test subsets. 


x train, x_test, y_train, y_test = train_test_split(df.drop(labels=['Time’, "Class'], axis = 1) , 
df['Class'], test_size-@.2, random_state-42) 

print(x_train.shape, ‘train samples") 

print(x_test.shape, ‘test samples’) 


(16328, 29) train samples 
(4080, 29) test samples 


Figure 4-8. Spliting the data into test and train sets, using 20% as holdout 
test data 


130 


CHAPTER 4 AUTOENCODERS 


Now it’s time to create a simple neural network model with just an encoder and 
decoder layer. You will encode the 29 columns of the input credit card dataset into 
12 features using the encoder. The decoder expands the 12 back into the 29 features. 


Figure 4-9 shows the code to create the neural network. 


encoding dim = 12 
input_dim = x_train. shape[i] 


inputArray = Input(shape=(input_dim, )) 
encoded = Dense(encoding dim, activation="relu*)(inputaArray) 


decoded = Dense(input_dim, activation-'softmax')(encoded) 


autoencoder = Model(inputéArray, decoded) 
autoencoder. summary () 


WARNING: tensorflow:From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\\op_def_library.py:263: colocate_ 
with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. 

Instructions for updating: 

Ccolocations handled automatically by placer. 


Layer (type) § § | Output Shape #8 Params 
laud 7 aan —_ —— sssscscas — 
dense_2 (Dense) (None, 25) 377 


Total params: 737 
Trainable params: 737 
Non-trainable params: 6 


Figure 4-9. Creating the simple autoencoder neural network 


If you look at the code in Figure 4-9, you will see two different activation functions, 
namely relu and softmax. So what are they? 

RELU, the Rectified Linear Unit, is the most commonly used activation function in 
deep learning models. The function returns 0 if it receives any negative input, but for any 


positive value xx it returns that value back. So it can be written as 
f(x)=max(0,x). 


Softmax, the Softmax function, outputs a vector that represents the probability 
distributions of a list of potential outcomes. The probabilities always add up to 1. 

Needless to say, there are several activation functions available and you can refer to 
the Keras documentation to look at the options at https: //keras.io/activations/. 

Now, compile the model using RMSprop as the optimizer and mean squared error 
for the loss computation. The RMSprop optimizer is similar to the gradient descent 
algorithm with momentum. A metric function is similar to a loss function, except that 


the results from evaluating a metric are not used when training the model. You may use 


131 


CHAPTER 4 AUTOENCODERS 


any of the loss functions as a metric function, as listed in https: //keras.io/losses/. 
Figure 4-10 shows the code to compile the model using mean absolute error and 


aCCUuracy as metrics. 


autoencoder.compile(optimizer=RMSprop(), 
loss='mean_squared error’, 
metrics=[‘mae’, ‘accuracy’ ]) 


Figure 4-10. Compiling the model 


Now you can start training the model using the training dataset to validate the model 
at every step. Choose 32 as the batchsize and 20 epochs. Figure 4-11 shows the code to 
train the model, which is the most time consuming part of the process. 


batch_size = 32 
epochs = 28 
history = autoencoder.fit(x_train, x_train, 
batch_size=batch_size, 
epochs=epochs, 
verbose=1, 
validation_data=(x_test, x_test), 
callbacks=(Tensoréoard(log dir='../logs/autoencoder1'))) 


Figure 4-11. Training the model 


As you see, the training process outputs the loss and accuracy, as well as the 
validation loss and validation accuracy at each epoch. Figure 4-12 shows the output of 


the training step. 


132 


Epoch 3/20 

16326/16326 [ semeeeeeeeeeeeeeeeeeeeeceesces | - 25 1@6us/step - 1055: 
O55: 1.6319 - val_mean_absolute_error: 6.6643 - val_acc: 6.6525 
Epoch 4/20 

16326/16328 [ Seeeeeeee cess eeeeeeeseseesess | - 26 1igéus/step - 1055: 
O55: 1.6290 - val_mean_absolute_error: 6.6636 - val_acc: 6.6843 
Epoch 5/26 

16326/16328 etrrtretierriier trip perio gy - 2s 187us/step - 1055: 
OSS: 1.6275 - val_mean_absolute error: @.6633 - val_acc: @.7071 
Epoch 6/2e 

163276/16328 [ semeeeeeseessesesseeseeseesess | = 25 1esus/step - 16s: 
O55: 1.6266 - val_mean_absolute_error: 6.6631 - val_acc: 6.7206 
Epoch 7/20 

16320/16320 [ Seeeeeeeeeeeeeeeeseeseeseesees } = 25 1@6us/step - 1055: 
OSS: 1.6258 - val_mean_absolute_error: 6.6636 - val_acc: 6.7373 
Epoch 8/26 

16326/16328 etrrtretierrrirrirrrrr perry = 25 116us/step - loss: 
O55: 1.6253 - val_mean_absolute_error: @.6636 - val_acc: 6.7482 
Epoch 9/2@ 

163276/16328 [seeeeeeeseesseseesesseesessess ] =~ 25 1i2us/step - 1665: 
OSS: 1.6249 - val_mean_absolute_error: @.6629 - val_acc: @.7593 
Epoch 16/20 

163278/163278 etrrrretirrriiertrerrr terior gs =» 2S 115us/step - 1o55: 
OSS! 1.6246 - val_mean_absolute_error: 6.6629 - val_acc: 6.7689 
Epoch 11/20 

16326/16328 etrrtretierrrirrirrrrr rere = 2s 113us/step - 1055: 
OSS: 1.6244 - val_mean_absolute_error: ©.6628 - val_acc: 6.7691 
Epoch 12/28 

16326/16328 [seeeeeeeeeeeseesesesssssessess | = 25 li4us/step - 1055: 
055: 1.6242 - val_mean_absolute_error: 6.6628 - val_acc: 6.7723 
Epoch 13/228 

163270/16328 Geerrrrrrterrrireprerrrpertrr gs = 2S 1@9us/step = Joss: 
O55: 1.6241 - val_mean_absolute_error: 6.6628 - val_acc: 6.7748 


Epoch 14/20 

163270/16320 [seeeeeeesessesssssessssssssss=] = £5 1160us/step - 1055: 
O55: 1.6239 - val_mean_absolute_error: 6.6628 - val_acc: 68.7775 
Epoch 15/28 


16320/16320 [seeseeeeereesereseerezerezezes| - 275 11705/step - loss: 
OSS: 1.6238 - val_mean_absolute_error: ©.6628 - val_acc: ®.7789 
Epoch 16/28 

1632708/163278 [ seweeerreresesersssrrerereestrs | = £5 1e3us/step = 1055: 
OSS: 1.6237 - val_mean_absolute_error: @.6628 - val_acc: 6.7792 
Epoch 17/28 

16326/1632e [seeeeeeseesessssssssesssessss=] = 25 101u5/Sstep - 1055: 
055: 1.6237 - val_mean_absolute_error: 68.6628 - val_acc: 6.7856 
Epoch 18/28 

16320/16320 [seeeseeereeeseecessereeeesessss| - 25 100u5/step - 1055: 
055: 1.6236 - val_mean_absolute_error: 6.6627 - val_acc: 6.7828 
Epoch 15/28 

163276/163278 [ eeweeresrrresersssrrereceertrs | = £5 1e3us/step = 1055: 
OSS: 1.6235 - val_mean_absolute_error: 6.6627 - val_acc: 6.7826 
Epoch 28/28 


O55: 1.6235 - val_mean_absolute_error: ©.6627 - val_acc: 6.7853 


1.4586 


1.4536 


1.4514 


1.4592 


1.4493 


1.4456 


1.4452 


1.4478 


1.4476 


1.4473 


1.4472 


1.4478 


1.4465 


1.4463 


1.4568 


1.4467 


1.4467 


1.4566 


i 


CHAPTER 4 AUTOENCODERS 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_ error: 


mean_absolute_error: 


méean_absolute_error: 


mean_absolute_error: 


Figure 4-12. Showing the progress of the training phase 


8.6595 


8.6582 


8.6578 


€.65/75 


8.6574 


8.6573 


@.65/2 


8.6572 


8.6572 


@.65/72 


@.65/1 


8.6571 


6.65/71 


@.6571 


8.6571 


8.6578 


®.657¢8 


8.6578 


acc: 


acc: 


acc; 


acc: 


acc: 


acc; 


acc: 


acc: 


acc; 


acc: 


acc? 


acc: 


acc: 


Bcc. 


acc: 


Bec: 


acc. 


acc: 


8.6291 


8.6718 


8.6953 


a. 7140 


8.7300 


8.7474 


8.7580 


8.7678 


O.7722 


8.7769 


@. 7820 


8.7847 


@.7871 


@. 7881 


8.7897 


8.7955 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


val_l 


- val_l 


val_l 


val_l 


133 


CHAPTER 4 AUTOENCODERS 


Figure 4-13 is a graph of the model as shown by TensorBoard. 


Main Graph Auxiliary 
Nodes 


IsVariableinitialized[0-1 1] 


vee 
derse_* 
derse.2 


training 





RMSprop = metrics poup.deps | loss 


dense_2 





dense_1 


Figure 4-13. Model graph shown in TensorBoard 


Figure 4-14 shows the plotting of the accuracy during the training process through 
the epochs of training. 





g 10 12 14 iL: 18 


Figure 4-14. Plotting of accuracy shown in TensorBoard 


134 


CHAPTER 4 AUTOENCODERS 


Figure 4-15 shows the plotting of the mae (mean absolute error) during the training 
process through the epochs of training. 


mean_absolute_error 


0 2? 4 6 8 10 12 14 16 18 


Figure 4-15. Plotting of mae shown in TensorBoard 


Figure 4-16 shows the plotting of the loss during the training process through the 
epochs of training. 


4 6 § 10 12 14 16 


Figure 4-16. Plotting of loss shown in TensorBoard 


135 


CHAPTER 4 AUTOENCODERS 


Figure 4-17 shows the plotting of the accuracy of validation during the training 
process through the epochs of training. 


val_acc 


Figure 4-17. Plotting of validation accuracy shown in TensorBoard 


Figure 4-18 shows the plotting of the loss of validation during the training process 
through the epochs of training. 


val_loss 


5 it] 5 10 15 20 


Figure 4-18. Plotting of validation loss shown in TensorBoard 


Now that the training process is complete, let’s evaluate the model for loss and 
accuracy. Figure 4-19 shows that the accuracy is 0.81, which is pretty good. It also shows 
the code to evaluate the model. 


136 


score = autoencoder.evaluate(x_test, x_test, verbose=-1) 
print('Test loss:", score[@]) 
print('Test accuracy:', score[1]) 


Test loss: 1.3027283556321088 
Test accuracy: @.81544117647e5882 


Figure 4-19. Code to evaluate the model 


CHAPTER 4 AUTOENCODERS 


The next step is to calculate the errors, and detect and also plot the anomalies and 


errors. Choose a threshold of 10. Figure 4-20 shows the code to measure anomalies 


based on that threshold. 


threshold=10.¢e¢0 
y_pred = autoencoder.predict(x_test) 
y_dist = np.linalg.norm(x_test - y_pred, axis=-1) 
Z = zip(y_dist >= threshold, y_dist) 
y_label-[] 
error = [] 
for idx, (is_anomaly, y_dist) in enumerate(z): 
if i5_anomaly: 
y_label.append(1) 
else: 
y_label .append(@) 
error.append(y_dist) 


Figure 4-20. Code to measure anomalies based on a threshold 


Let’s delve deeper into the code shown above because this will be seen throughout 


the chapter when you classify data points as anomalies or normal. As you can see, this 


is based on a special parameter called the threshold. You are simply looking at the error 


(difference between actual and predicted) and comparing it to the threshold. First, 


calculate the precision and recall for threshold = 10. Figure 4-21a shows the code to show 


the precision and recall. 


print(classification_report(y_test,y label) ) 


precision recall #1-score 


a) 1.00 8.97 6.98 

i 2.41 8.86 8.56 

accuracy 0.97 
macro avg @.71 8.92 0.77 
weighted avg 2.98 8.97 8.97 


Figure 4-21la. Code to show the precision and recall 


support 


3987 
93 


42808 
4280 
42080 


137 


CHAPTER 4 AUTOENCODERS 


Let’s also calculate for thresholds = 1, 5, 15. See Figures 4-21b, 4-21c, and 4-21d. 
Threshold = 1.0 


print(classification_report(y_test,y_label)) 


precision recall f1-score support 

@ 2.00 8.02 0.22 3987 

1 @.82 1.62 8.04 93 

accuracy 0.02 4280 
macro avg @.61 8.52 8.22 4289 
weighted avg 2.00 8.02 0.02 4280 


Figure 4-21b. Code to show the precision and recall for threshold = 1.0 


Threshold = 5.0 


print(classification_report(y_test,y_label)) 


precision recall #1-score support 

@ 1.00 @.75 8.86 3987 

i @.98 0.97 8.15 93 

accuracy 8.76 4289 
macro avg @.54 8.86 8.51 4280 
weighted avg 2.98 0.76 0.84 4280 


Figure 4-21c. Code to show the precision and recall for threshold = 5.0 


Threshold = 15.0 


print(classification_report(y_test,y_label)) 


precision recall #1-score support 

@ 8.99 8.99 8.99 3987 

1 @.57 @.66 80.61 93 

accuracy 8.98 4289 
macro avg @.78 @.82 8.82 4289 
weighted avg 2.98 @.98 8.98 4880 


Figure 4-21d. Code to show the precision and recall for threshold = 15.0 


If you observe the four classification reports, you can see that the precision and recall 
columns are not good (note the very low values for precision and recall in row 0 and row 1) 
for threshold = 1 or 5. They look better for threshold = 10 or 15. In fact, threshold = 10 
looks pretty good with a good recall and also higher precision than for threshold = 1 or 5. 


138 


CHAPTER 4 AUTOENCODERS 


Picking a threshold is a matter of experimentation in this and other models and 
changes as per the data being trained on. 

Compute the AUC (Area Under the Curve, 0.0 to 1.0) which comes up as 0.86. 
Figure 4-21e shows the code to show AUC. 


roc_auc_score(y_test, y_label) 
6. 8650574043059298 


Figure 4-2le. Code to show AUC 


You can now visualize the confusion matrix to see how well you did with the model. 
Figure 4-22 shows the confusion matrix. 


viz = Visualization( ) 
viz.draw_confusion_matrix(y_test, y_label) 


Confusion Matri 





Predicted 


Figure 4-22. Confusion matrix 


Now, using the predictions of the labels (normal or anomaly), you can plot the 
anomalies in comparison to the normal data points. Figure 4-23 shows the anomalies 
based on the threshold. 


139 


CHAPTER 4 AUTOENCODERS 


viz.draw_anomaly(y_test, error, threshold) 











Anomalies 
ur s @® Moral 
| * A Arnierialy 
— Threshold 
Ez 
= 
a 
My 
r. 
x = 
a 4 e 
* 
a 
=e ® 
| 
‘| ry a . 











Dats 


Figure 4-23. Anomalies based on the threshold 


Sparse Autoencoders 


In the above example of a simple autoencoder, the representations were only 
constrained by the size of the hidden layer (12). In such a situation, what typically 
happens is that the hidden layer is learning an approximation of PCA (principal 
component analysis). But another way to constrain the representations to be compact 
is to add a sparsity constraint on the activity of the hidden representations, so fewer 
units would fire at a given time. In Keras, this can be done by adding an activity __ 
regularizer to your dense layer. 

The difference between the simple and sparse autoencoders is mostly due to the 


regularization term being added to the loss during training. 
from keras import regularizers 


You will use the same credit card dataset as in the simple autoencoder example 
above. You will use the credit card data to detect whether a transaction is normal/ 
expected or abnormal/anomaly. Shown below is the data being loaded into a Pandas 


dataframe. 


140 


CHAPTER 4 AUTOENCODERS 


Then, you will collect 20k normal and 400 abnormal records. You can pick different 
ratios to try, but in general more normal data examples are better because you want 
to teach your autoencoder what normal data looks like. Too much abnormal data in 
training will train the autoencoder to learn that the anomalies are actually normal, which 
goes against your goal. Split the dataframe into training and testing data sets (80-20 split). 
Now it’s time to create a neural network model with just an encoder and 
decoder layer. You will encode the 29 columns of the input credit card dataset into 
12 features using the encoder. The decoder will expand the 12 back into 29 features. 
The key difference compared to the simple autoencoder is the activity regularizer to 
accommodate the sparse autoencoder. Figure 4-24 shows the code to create the neural 
network. 


encoding dim = 12 
input_dim = x_train.shape[i) 


inputArray = Input(shape=(input_dim, )) 
encoded = Dense(encoding dim, activation='relu', 
activity_regularizer=regularizers.11(1e¢e-5))(inputarray) 


decoded = Dense(input_dim, activation-'softmax')(encoded) 


auteencoder = Model(inputarray, decoded) 
autoencoder. summary( ) 


WARNING: tensorflow:From C:\Programpata\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_ 
with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. 

Instructions for updating: 

colocations handled automatically by placer. 


Layer (type) =———“‘<‘é;:é tpt Shape—ttitéPar 
iL. CC ck SS 
dense_i (Dense) (None, 12) 360 
dense_2 (Dense) (None, 25) 377 


Total params: 737 
Trainable params: 737 
Non-trainable params: 6 


Figure 4-24. Code to create the neural network 


141 


CHAPTER 4 AUTOENCODERS 


Figure 4-25 shows the graph of the model as visualized by TensorBoard. 


Main Graph Auxiliary Nodes 
(Gre: = - Ga: = = Se 
dense.2 " dense_4 ~ ve rites — 
os weiee tcl 
7] Jonse wae fo on. = 
= = ah — 
janelle a 
Dene. or 
|_ RMSprop “ 
RMSpropt }.- “vw 


Figure 4-25. Model graph created by TensorBoard 


Deep Autoencoders 


You do not have to limit yourself to a single layer as encoder or decoder; you can use a 
stack of layers. It’s not a good idea to use too many hidden layers, and how many layers 
depends on the use case, so you have to play with it to seek the optimal number of layers 
and the compressions. 

The only thing that really changes is the number of layers. Shown below is the simple 
autoencoder with multiple layers. 

You will use the example of credit card data to detect whether a transaction is 
normal/expected or abnormal/anomaly. Shown below is the data being loaded into 
Pandas dataframe. 

You will collect 20k normal and 400 abnormal records. You can pick different ratios 
to try, but in general more normal data examples are better because you want to teach 
your autoencoder what normal data looks like. Too much abnormal data in training 
will train the autoencoder to learn that the anomalies are actually normal, which goes 
against your goal. Split the dataframe into training and testing data sets (80-20 split). 

Now it’s time to create a deep neural network model with three layers for the encoder 
layer and three layers as part of decoder layer. You will encode the 29 columns of the 
input credit card dataset into 16, then 8, and then 4 features using the encoder. The 
decoder expands the 4 back into the 8 and then 16 and then finally into 29 features. 


Figure 4-26 shows the code to create the neural network. 


142 


#deep autoencoder 
logfilename = "deepautoencoder” 


encoding_dim = 16 
input_dim = x_train.shape[1] 


inputArray = Input(shape=(input_dim, )) 

encoded = Dense(encoding_dim, activation='relu')(inputarray) 
encoded = Dense(8, activation='relu')(encoded) 

encoded = Dense(4, activation=-'relu')(encoded) 


decoded = Dense(&, activation='relu‘)(encoded) 
decoded = Dense(encoding_dim, activation="relu')(decoded) 
decoded = Dense(input_dim, activation='softmax' )(decoded) 


autoencoder = Model(inputaérray, decoded) 


autoencoder .. summary () 

Layer (type) Output Shape Param # 
input_7 (Inputvayer) (None, 29) 
dense_is (Dense) (itsti(‘é‘MOAzw GYD CS 
dense 16 (Dense) (None, 8) 136 
dense_17 (Dense) (None, 4) 36 
dense_is (Dense) ~ (None, 8)” 40 
dense_19 (Dense) (None, 16) 144 ‘ 
dense_2@ (Dense) (None, 29) 493 


SSS SSS SSS SS SSS SS SSS SSS SS SSS SS SS SSS SS SSS SSS SS SSS SS SSS SS eS ee ee 
Total params: 1,329 

Trainable params: 1,329 

Non-trainable params: 6 


Figure 4-26. Code to create the neural network 


CHAPTER 4 AUTOENCODERS 


Figure 4-27 shows the graph of the model as visualized by TensorBoard. 





° m=, 
ae. 4 
ee 


P 

J j 
J / J 
2 PF : gen Pl 

Sah dda serdall. “4 Serda_ 4. Saree 4, . ~~" . ~l 
. ee " eens | Sense? . ouws , — - — _ senses | : cy [dense | . ag 
~ wong \ : ~ weno ~ womegs 
. 


Figure 4-27. Model graph shown in TensorBoard 





ot 
outta MA ceGD 5 
“rw - § > 
re 
co oe 
Deel + 
wee 
eS 
euuntte. : —- 
r oe ~ 
ee) waining=2 ie 
ee eee 
ao 
wae ee 
ell yom ‘ , 
—- . . 
a od “) 
pasage at taining_1 wee 
— 2 —_- 
rwml 
was 
ete wm 
_—. cus % 
ms - . - oad 
qna3m . ate 9 . weer 
iw 
- 
“ree 
rs 
-* 
ree 
rors" 
~~ 
“wee 
vres+g3 





143 


CHAPTER 4 AUTOENCODERS 


Convolutional Autoencoders 


Whenever your inputs are images, it makes sense to use convolutional neural networks 
(convnets or CNNs) as encoders and decoders. In practical settings, autoencoders 
applied to images are always convolutional autoencoders because they simply perform 
much better. 

Let’s implement one. The encoder will consist in a stack of Conv2D and MaxPooling2D 
layers (max pooling is being used for spatial down-sampling), while the decoder will 
consist in a stack of Conv2D and UpSamp1ling2D layers. 

Figure 4-28 shows the basic code to import all necessary packages in a Jupyter 
notebook. Also note the versions of the various packages. 


import keras 

from keras import optimizers 

from keras import losses 

from keras.models import Sequential, Model 

from keras.layers import Dense, Input, Dropout, Embedding, LSTM 
from keras.optimizers import RMSprop, Adam, Nadam 

from keras.preprocessing import sequence 

from keras.callbacks import Tensorsoard 

from keras import regularizers 


import sklearn 

from sklearn.preprocessing import StandardsScaler 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import confusion_matrix, roc_auc_score 
from sklearn.preprocessing import MinMaxScaler 


import seaborn as sns 
import pandas as pd 
import numpy as np 
import matplotlib 


import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 
Xmatplotlib inline 


import tensorflow 
import sys 
print("Python: ", sys.version) 


print("pandas: “, pd.__version_) 
print("numpy: ", np.__version_) 
print("seaborn: ", sns.__version_) 
print("matplotlib: ", matplotlib.__version_) 
print(“sklearn: ", sklearn.__version_) 
print("Keras: “, keras.__version_ ) 
print("“Tensorflow: “, tensorflow.__version_) 


Using TensorFlow backend. 


Python: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AD64)] 
pandas: 9.24.2 
numpy: 1.16.3 
seaborn: 98.9.9 
matplotlib: 3. 
sklearn: 96.2@. 
Keras: 2.2.4 

Tensorflow: 1.13.1 


6.3 
3 


Figure 4-28. Importing packages in a Juypter notebook 
144 


CHAPTER 4 AUTOENCODERS 


You will use the mnist images data set for this purpose. Mnist contains images for the 
digits 0 to 9 and is used for many different use cases. Figure 4-29 shows the code to load 
MNIST data. 


from keras.datasets import mnist 
import numpy as np 


(x_train, _), (x_test, _) = mnist.load_data() 


Figure 4-29. Code to load MNIST data 


Split the dataset into training and testing subsets. You must also reshape the data to 
28X28 images. Figure 4-30 shows the code to transform the images from MNIST. 


from keras.datasets import mnist 
import numpy as np 


(x_train, _), (xtest, _) = mnist.load_data() 


x_train = x_train.astype(‘float32") / 255. 

x_test = x_test.astype(‘float32") / 255. 

x_train = np.reshape(x_train, (len(x_train), 28, 28, 1)) # adapt this if using channels first image data formot 
x_test = np.reshape(x_test, (len(x_test), 28, 28, 1)) # adapt this if using ‘channels first’ image data formot 


Figure 4-30. Code to transform the images from MNIST 


Create a CNN model with Convolutions and MaxPool layers. Figure 4-31 shows the 


code to create the neural network. 


145 


CHAPTER 4 AUTOENCODERS 


from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D 
from keras.models import Model 
from keras import backend as K 


#cnn autoencoder 
logfilename = “cnnautoencoder2” 


input_img = Input(shape=(28, 28, 1)) # adapt this if using “channels first’ image data format 


x = Conv20(16, (3, 3), activation='relu’, padding='same*)(input_img) 
X = MaxPooling2D((2, 2), padding="same*)(x) 

x = Conv20(8, (3, 3), activation="relu", padding='same')(x) 

x = MaxPooling20((2, 2), padding="same*)(x) 

x = Conv20(8, (3, 3), activation='relu’, padding=‘same')(x) 

encoded = MaxPooling2D((2, 2), padding="same‘)(x) 


# at this point the representation is (4, 4, 8) t.e. 128-dimensional 


xX = Conv20(8, (3, 3), activation='relu', padding='bame') (encoded) 

X = UpSampling20((2, 2))(x) 

x = Conv20(8, (3, 3), activation="relu’, padding=‘same')(x) 

X = UpSampling20((2, 2))(x) 

x = Conv20(16, (3, 3), activation="relu")(x) 

xX = UpSampling20((2, 2))(x) 

decoded = Conv20(1, (3, 3), activation="sigmoid’, padding='same')(x) 


autoencoder = Model(input_img, decoded) 


autoencoder. summary () 


WARNING: tensorflow:From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_ 
with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. 

Instructions for updating: 

Colocations handled automatically by placer. 


Layer (type) == Output shape—(“‘éeéCéPOr OM 
input_1 (Inputlayer) (None, 28, 28,1) = 
conv2d_1 (Conv2D) (None, 28, 28, 16) 162 
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 16) ® 
conv2d_2 (Conv20) (None, 14, 14, 8) 116¢ 
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 8) 8 
conv2d_3 (Conv20) (None, 7, 7, 8) ses 
max_pooling2d_3 (MaxPooling2 (None, 4, 4, 8) e 
conv2d_4 (Conv2D) (None, 4, 4, 8) S84 
up_sampling2d_1 (UpSampling2 (None, 8, 8, 8) & 
conv2d_S (Conv20) (None, 8, 8, 8) S84 
up_sampling2d_2 (UpSampling2 (None, 16, 16, 8) @ 
conv2d_6 (Conv2D) (None, 14, 14, 16) 1168 
up_sampling2d 3 (UpSampling2 (None, 28, 28, 16) 8 
conv2d_7 (Conv20) (None, 28, 28, 1) 145 


Total params: 4,385 
Trainable params: 4,385 
Non-trainable params: @ 


Figure 4-31. Code to create the neural network 


Compile the model using RMSprop as the optimizer and mean squared error for the 
loss computation. The RMSprop optimizer is similar to the gradient descent algorithm 
with momentum. Figure 4-32 shows the code to compile the model. 


146 


CHAPTER 4 AUTOENCODERS 


putoencoder.compile(optimizer=RMSprop(), 
loss="mean_ squared error’, 
metrics=[‘mae', “accuracy']) 


Figure 4-32. Code to compile the model 


Now you can start training the model using the training dataset while using the 
validation dataset to validate the model at every step. Choose 32 as the batchsize and 20 
epochs. The training process outputs the loss and accuracy as well as the validation loss 
and validation accuracy at each epoch. Figure 4-33 shows the model being trained. 


batch_size = 32 
epochs = 20 


history = autoencoder.fit(x_train, x_train, 
batch_size=batch_size, 
epochs=epochs, 
verbose=1, 
shuffle-True, 
validation_data=(x_test, x_test), 
callbacks=[Tensorboard(log dir="../logs/{e}*.format(logfilename))]) 





26 

Epoch 15/20 

60000/60000 [eaaccenccescecsscesacececeezes] - 125 194u5/step - 1055: 8.0116 - acc: @.8129 - val_loss: 8.6105 - val_acc: @.81 
24 

Epoch 16/20 

60000/66000 [exsessecesenensezessesezeese=s] - 125 196u5/step - 1055: 6.6114 - acc: 6.8136 - val_loss: 6.6117 - val_acc: 6.81 
eg 

Epoch 17/20 

60000/60000 [aaaeseeceseceesecsssssescesess] - 115 183u5/step - 1055: 6.0113 - acc: @.8131 - val_loss: 6.6108 - val_acc: 6.81 
36 

Epoch 18/20 

60000/60000 [aaeasanseuseccusecessesezeesess] - 115 188us/step - loss: 6.0112 - acc: @.8131 - val_loss: 8.6103 - val_acc: @.81 
27 

Epoch 19/28 

60000/60000 [==2s2e2eeeseeezeseseese=ss====] - 115 190u5/step - 1055: 6.0110 - acc: 6.8132 - val_loss: 6.0107 - val_acc: 6.81 


Epoch 20/20 
60000/60000 [s=sessessssesesescessssssesse=]} - 125 152u5/step - loss: 6.0169 - acc: 6.8132 - val_loss: 98.6103 - val_acc: 0.81 


Figure 4-33. The model being trained 


Now that the training process is complete, let’s evaluate the model for loss and 
accuracy. Figure 4-34 shows that the accuracy is 0.81, which is pretty good. It also shows 
the code to evaluate the model. 


score = autoencoder.evaluate(x_test, x_test, verbose=1) 
print("Test loss:", score[6]) 
print('Test accuracy:", score[1]) 


16000/10000 [===x=e==ee=enzeszecz=eszzerrez=z| - 15 68u5/step 
Test loss: 9.010284392775595189 
Test accuracy: 6.8126302285194397 


Figure 4-34. Code to evaluate the model 


147 


CHAPTER 4 AUTOENCODERS 


The next step is to use the model to generate the output images for the testing subset. 


This will show how well the reconstruction phase is going. Figure 4-35 shows the code to 


predict based on the model. 


decoded_imes = autoencoder .predict(x_test) 


i = 
olt. 
for 


plt. 





2RO CI MMSE, 


18 

figure(figsize-(28, 4)) 

i in range(i, n): 

# disploy original 

ax = plt.subplot(2, n, i) 
plt.imshow(x_test[i].reshape(2s, 28)) 
plt.gray() 
ax.get_xaxis().set_visible(False) 
ax.get_yaxis().set_visible(False) 


# display reconstruction 

ax = plt.subplot(2, m, 1 + m) 
plt.imshow(decoded_imes|1].reshape(28, 28)) 
pit.gray() 

ax.get xaxis().set_visible(False) 
ax.get_yaxis().set_visible(False) 

Show() 


ATIC CIMA GANG E2 






Figure 4-35. Code to predict based on the model 


You can also see how the encoder phase is working by displaying the test subset 


images in this phase. Figure 4-36 shows the code to display the encoded images. 


148 


CHAPTER 4 AUTOENCODERS 


encoder = Model(input_img, encoded) 
encoded_imgs = encoder.predict(x_test) 
n= 10 
plt.figure(figsize=-(2e, &)) 
for i in range(i, n): 
ax = plt.subplot(i, n, i) 
plt.imshow(encoded_imgs[i].reshape(4, 4 * 8).T) 
plt.gray() 
ax.get_xaxis().set_visible(False) 
ax.get_yaxis().set_visible(False) 
plt.show() 





Figure 4-36. Code to display encoded images 


Figure 4-37 shows the graph of the model as visualized by TensorBoard. 


149 


CHAPTER 4 AUTOENCODERS 


Main Graph Auxiliary Nodes 





) ooh fades mnt | ieee 
[ loss ——- \ metrics Senin seine nae 








a" 


. eae 


training 


- ee a 
wary ~— 
ee tetas ODD! 





Pe told 
, aon onmblelnitialized|O-31] 

— peed? * ——T5 

Pe a 

pert: 

toe 
we 
—e f a ha =e 
RMSprop + 


ee ee 


ee 
Bt] 
eae 


= 
ewes 
cave a 





rare 


a | 
wars 
Gb ab ee 


re 
CS rd 
ae 





CC ering 


Figure 4-37. A model graph shown in TensorBoard 


Figure 4-38 shows the plotting of the accuracy during the training process through 
the epochs of training. 


150 


CHAPTER 4 AUTOENCODERS 


Smoothed Value Step Time Relative 


| ‘@) cnnauicencoder? 0.8119 0.8123 8 Wed Jun 12, 22:24:52 1m 336 





Figure 4-38. Plotting of accuracy shown in TensorBoard 


Figure 4-39 shows the plotting of the loss during the training process through the 
epochs of training. 


loss 


0 1 Z 3 4 5 5 fi a 9 10 


Figure 4-39. Plotting of loss shown in TensorBoard 


151 


CHAPTER 4 AUTOENCODERS 


Figure 4-40 shows the plotting of the accuracy of validation during the training 


process through the epochs of training. 


val_ace 


Figure 4-40. Plotting of validation accuracy shown in TensorBoard 


Figure 4-41 shows the plotting of the loss of validation during the training process 


through the epochs of training. 


Hame Smoothed Value Step Time Relative 
@ cnnautoencoder? 0.01759 0.01661 3 WedJun12,22-23:54 35s 





Figure 4-41. Plotting of validation loss shown in TensorBoard 


152 


CHAPTER 4 AUTOENCODERS 


Denoising Autoencoders 


You can force the autoencoder to learn useful features by adding random noise to its 
inputs and making it recover the original noise-free data. This way the autoencoder 
can’t simply copy the input to its output because the input also contains random noise. 
The autoencoder will remove noise and produce the underlying meaningful data. 

This is called a denoising autoencoder. Figure 4-42 shows a depiction of a denoising 


autoencoder. 


Denoising Autoencoder 


Encoder Decoder 











Figure 4-42. Depiction of a denoising autoencoder 


Other example is a security monitoring camera capturing some kind of hazy unclear 
picture, maybe in the dark or during adverse weather, causing a noisy image. 

The logic behind the denoising autoencoder is that if you have trained your encoder 
on good normal images, and the noise, when it comes as part of the input, is not really a 
salient characteristic, it is possible to detect and remove such noise. 

Figure 4-43 shows the basic code to import all necessary packages. Also note the 


versions of the various packages. 


153 


CHAPTER 4 AUTOENCODERS 


import keras 

from keras import optimizers 

from keras import losses 

from keras.models import Sequential, Model 

from keras.layers import Dense, Input, Dropout, Embedding, LSTM 
from keras.optimizers import RMSprop, Adam, Nadam 

from keras.preprocessing import sequence 

from keras.callbacks import Tensoréoard 

from keras import regularizers 


import sklearn 

from sklearn.preprocessing import Standardscaler 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import confusion_matrix, roc_auc_score 
from sklearn.preprocessing import MinMaxScaler 


import seaborn as 5ns 

eet pandas as pd 
import numpy as np 

import matplotlib 


import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 
Amatplotlib inline 


import tensorflow 
import sys 
print("Python: ", sys.version) 


print("pandas: ", pd.__version_) 
print(“numpy: “, np._ version_) 
print("seaborn: ", sns. __version_) 

print (“satplotlib: ", matplotlib. version_) 
print("sklearn: ", sklearn.__version_) 
print("Keras: ", keras.__version_) 
print("Tensorflow: ", tensorflow._version_) 


Using TensorFlow backend. 


python: 3.7.1 (default, Dec 1@ 2018, 22:54:23) [MSC v.1915 64 bit (AND64)] 
pandas: @.24,2 


numpy: 1.16.3 
seaborn: 6.9.0 


matplotlib: 3.6.3 
sklearn: 6.20.3 
Keras: 2.2.4 
Tensorflow: 1.13.1 


Figure 4-43. Code to import packages 


You will use the mnist images data set for this purpose. Mnist contains images for the 
digits 0 to 9 and is used for many different use cases. Figure 4-44 shows the code to load 
MNIST images. 


from keras.datasets import mnist 


import numpy as np 
(x_train, _), (x_test, _) = mnist.load_data() 


Figure 4-44, Code to load MNIST images 


154 


CHAPTER 4 AUTOENCODERS 


Split the dataset into training and testing subsets. Also, reshape the data to 28X28 
images. Figure 4-45 shows the code to load and reshape images. 


from keras.datasets import mnist 
import numpy as np 


(x_train, _), (x_test, y_test) = mnist.load_data() 


x_train = x_train.astype(‘*float32') / 255. 

x_test = x_test.astype('float32') / 255. 

x_train = np.reshape(x_train, (len(x_train), 28, 28, 1)) # adapt this if using ‘channels_first’ image data format 
x_test = np.reshape(x_test, (len(x_test), 28, 28, 1)) # adopt this if using channels first image dato format 


noise factor = 0.3 
x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train. shape) 
x_test_noisy = x_test + noise_factor * np.random.normal(loc=8.9, scale=1.0, size=x_test.shape) 


x_train_noisy = np.clip(x_train_noisy, @., 1.) 
x_test_noisy = np.clip(x_test_noisy, @., 1.) 


print (x_train_noisy. shape) 
print(x_test_noisy. shape) 
print(y_test. shape) 


(69000, 28, 28, 1) 
(10800, 28, 28, 1) 
(16e2@, ) 


Figure 4-45. Code to load and reshape images 


Figure 4-46 shows the code to display the images. 


n= ii 

plt.figure(figsize=(20, 2)) 

for i in range(i1, n): 
ax = plt.subplot(i1, n, i) 
plt.imshow(x_test_noisy[i].reshape(28, 28)) 
plt.gray() 
ax.get_xaxis().set_visible(False) 
ax.get_yaxis().set_visible(False) 

plt.show() 





Figure 4-46. Code to display the images 


Create a CNN model with Convolutions and MaxPool layers. Figure 4-47 shows the 


code to create the neural network. 


155 


CHAPTER 4 AUTOENCODERS 


from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D 
from keras.models import Model 
from keras import backend as K 


#cnn outoencoder 
logfilename = “DenoisingAutoencoder2" 


input_ime = Input(shape=(28, 28, 1)) # adapt this if using channels_first image dota formot 


= ConvzD(16, (3, 3), activation="relu’, padding="same")(input_img) 
MaxPooling2D((2, 2), padding="same")(x) 

conv2D(8, (3, 3), activation="relu’, padding='same")(x) 
MaxPooling2D((2, 2), padding="same')(x) 

conv20(8, (3, 3), activation="relu', padding='same')(x) 

encoded = MaxPooling2D((2, 2), padding="same*)(x) 


x 
x 
x 
x 
x 


# at this point the representation is (4, 4, 8) i.e. 128-dimensional 


x = Conv20(8, (3, 3), activation-'relu', padding-'same') (encoded) 

x = UpSampling2D((2, 2))(x) 

x = Conv2D(8, (3, 3), activation="relu’, padding="same')(x) 

xX = UpSampling2D((2, 2))(x) 

x = Conv2D(16, (3, 3), activation='relu')(x) 

xX = UpSampling2D((2, 2))(x) 

decoded = Conv2D(1, (3, 3), activation="sigmoid', padding="same*)(x) 


autoencoder = Model(input_img, decoded) 


autoencoder.. summary() 

Layer (type)  —«- Output Shape (iti(‘iéPzrm 
input_2 (Inputuayer) (None, 28, 28,3) = @ 
conv2d_8 (Conv2D) (None, 28, 28, 16) 168 
max_pooling2d_4 (MaxPooling2 (None, 14, 14, 16) @ 

conv2d_9 (Conv2D) (None, 14, 14, 8) 1168 
max_pooling2d_5 (MaxPooling2 (None, 7, 7, %) 8 
conv2d_1@ (Conv2D) (None, 7, 7, 8) 584 
max_pooling2d_6 (MaxPooling2 (None, 4, 4, 3) i] 
conv2d_11 (Conv2D) (None, 4, 4, 2) 534 
up_sampling2d_4 (UpSampling2 (None, 8, 8, 8) a 
conv2d_12 (Conv2D) (None, 8, 8, &) 584 
up_sampling2d_5 (UpSampling2 (None, 16, 16, 8) Ls] 
conv2d_13 (Conv2D) (None, 14, 14, 16) 1165 
up_sampling2d_6 (UpSampling2 (None, 28, 28, 16) a 
conv2d_14 (Conv2D) (None, 28, 28, 1) 145 


Total params: 4,385 
Trainable params: 4,385 
Non-trainable params: 6 


Figure 4-47. Code to create the neural network 


Compile the model using RMSprop as the optimizer and mean squared error for the 
loss computation. The RMSprop optimizer is similar to the gradient descent algorithm 
with momentum. Figure 4-48 shows the code to compile the model. 


156 


autoencoder.compile(optimizer=RMSprop(), 


loss=‘mean_squared_error’, 
metrics=['mae', ‘accuracy"]) 


Figure 4-48. Code to compile the model 


CHAPTER 4 AUTOENCODERS 


Now, you can start training the model using the training dataset to validate the model 


at every step. Choose 32 as the batchsize and 20 epochs. The training process outputs 


the loss and accuracy as well as the validation loss and validation accuracy at each epoch. 


Figure 4-49 shows the code to start training the model. 


batch_size = 32 
epochs = 20 


history = autoencoder.fit(x_train_noisy, x_train, 
batch_size-batch_size, 
epochs=epochs, 
verbose=1, 
shuffle-True, 
validation_data=(x_test_noisy, x_test), 


callbacks=[TensorBoard(log dir="../logs/{@}'.format(logfilename) )}) 


Train on 6eeee samples, validate on 1000@ samples 
Epoch 1/2¢e 


600202 / 60208 [sesesscsesssssssssssssssssssss } - 135 212us/step - 
Epoch 2/2¢é 

6020 | 60203 [ssseseccussesessesesesessssszs } - 125 200us/step - 
Epoch 3/2¢e 

6eeee / GEeeO [seeeeeeeeesseseeeseeseeeeee2=22] - 125 198us/step - 
Epoch 4/2¢e 

6OG22 / 60203 [seseseeeeeseeeseeesssseeeze=22]} - 125 193us/step - 
Epoch 5/2 

60022 / GA2e3 [ sseeeseseeseseseesssessssese=s } - 125 193us/step - 
Epoch 6/2 

60G22 ' Gee { seseueeeeesecesecesesesseeesss } - 125 192us/step - 
Epoch 7/28 

SOG22 / GE2e0 [ seeeeseeeseeeesecesseresereezs } - 11s 191us/step - 
Epoch 8/28 

6022 / 6e2e3 [ saseusceussecessesssssesesssss } - 125 193us/step - 
Epoch 9/28 

60000/60000 [a=axeeeeeeeeeeeesesssesesezes=s] - 115 189us/step - 
Epoch 10/20 

60000/60000 [==eseeeeeeeeeeeeseseeeeeeees==] - 115 190usS/step - 
Epoch 11/20 

60022 / 6e2e0 G@ereeerrerrirerrrrrrrrrrrirrrry| - 125 203us/step - 
Epoch 12/20 

600202 / A203 [sseeesceessssessesssssesesse==] - 125 192us/step - 
Epoch 13/20 

60022 ' 6e2ea [ sseeeeeeeeeeeeseeesserssesess: } - 125 192us/step - 
Epoch 14/20 

6OR22 / Gazea [ sesucecccssscesessssesssssssss } - 125 192us/step - 
Epoch 15/20 

6022 / 6E2e0 [ sesesccussssssssssssssesssssss } - 115 19@us/step - 
Epoch 16/20 

SOG22 | 6022 [sseuesecesescesesesessssesssss } - lis 191us/step - 
Epoch 17/20 

60000/60000 [a=eeeneeceseeeeseseseseeesezezs] - 115 190uS/step - 
Epoch 18/20 

600202 / 6e2e0 [seeeeseeeeeeeseseeeeseeeee2e22] - 115 191lus/step - 
Epoch 19/20 

600202 / 60203 [s2seeeeeeeseeseeeesseseeeees==] - 125 192us/step - 
Epoch 20/20 

BGR ' GA2E3 [ saeeeseeeszeeessesesessssezezs | = 115 19@us/step - 


loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 
loss: 


loss: 


@.0378 


@.9229 


&.9208 


@.0185 


@.0176 


@.0178 


©.0164 


@.0168 


@.0157 


@.0153 


@.0151 


@.0149 


@.0147 


@.0145 


&.0144 


@.0143 


@.0141 


&.91428 


@.0139 


@.0138 


Figure 4-49. Code to start training the model 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


acc: 


@.8872 


@.8108 


@.8125 


@.81288 


@.8111 


@.8113 


@.8115 


@.8117 


@.8118 


@.8119 


@.8121 


@.8121 


@.8122 


@.8122 


@.8123 


@.8123 


@.8124 


val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 
val_loss: 


val_loss: 


-0257 


-0197 


- 8190 


- 8176 


- 8168 


- 0163 


-0159 


- 8156 


-0155 


+0147 


+8142 


. e144 


- 8166 


. 8149 


-0134 


-0133 


- 0146 


-8153 


+8135 


- 6127 


val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 
val_acc: 


val_acc: 


- 8107 


-8111 


-8114 


-8110 


- 8189 


-81e8 


-8124 


-8123 


-81e9 


-8116 


-8101 


- 8107 


-8117 


157 


CHAPTER 4 AUTOENCODERS 


Now that the training process is complete, let’s evaluate the model for loss and 
accuracy. Figure 4-50 shows that the accuracy is 0.81, which is pretty good. It also shows 


the code to evaluate the model. 


score = autoencoder.evaluate(x_test, x_test, verbose=1) 
print('Test loss:", score[@}) 
print('Test accuracy:", score[1}) 


19000 /1e0eee [ senezeeeseeeesseeeseesszeees=ss | - 15 68us/step 
Test loss: @.010875144922733306 
Test accuracy: @.8120423462867736 


Figure 4-50. Code to evaluate the model 


The next step is to use the model to generate the output images for the testing subset. 
This will show you how well the reconstruction phase is going on. Figure 4-51 shows the 


code to display denoised images. 


decoded_imgs = autoencoder .predict(x_test_noisy) 


n= 16 

plt.figure(figsize-(26, 4)) 

for i in range(i, n): 
# display original 
ax = plt.subplot(2, n, i) 
plt.imshow(x_test_noisy[i] .reshape(2s, 23)) 
plt.gray() 
ax.get_xaxis().set_visible(False) 
ax.get_yaxis().set_visible(False) 


# display reconstruction 

ax = plt.subplot(2, m, 1 + m) 

plt.imshow(decoded_imgs[i].reshape(22, 23)) 

pit.gray() 

ax. get “axis().set_visible(False) 

ax.get_yaxis().set_visible(False) 
plt.show() 





Figure 4-51. Code to display denoised images 


158 


CHAPTER 4 AUTOENCODERS 


You can also see how the encoder phase is working by displaying the test subset 
images in this phase. Figure 4-52 show the code to display encoded images. 


encoder = Model(input_img, encoded) 
encoded_imgs = encoder.predict(x_test_noisy) 
n= 10 
plt.figure(figsize-(2¢e, &)) 
for i in range(1, n): 
ax = plt.subplot(i, n, i) 
plt.imshow(encoded_imgs[i].reshape(4, 4 * 8).T) 
plt.gray() 
ax.get_xaxis().set_visible(False) 
ax.get_yaxis().set_visible(False) 
plt.show() 





Figure 4-52. Code to display encoded images 


Figure 4-53 shows the graph of the model as visualized by TensorBoard. 


159 


CHAPTER 4 AUTOENCODERS 


Main Graph Auxiliary Nodes 


cores com FOF 
covatlt! = ateecial zed[0-63} 
re * ——> 


wa. eS 





ow =" 





_>- ED wunng 


Figure 4-53. Model graph shown in TensorBoard 


Figure 4-54 shows the plotting of the accuracy during the training process through 


the epochs of training. 


160 


CHAPTER 4 AUTOENCODERS 


Smoothed Value Step Time Relative 
i O DenoisingAutoencoder2 0.8121 O81272 15 Wed Jun 12,22:57:53 2m 536 





Figure 4-54, Plotting of accuracy shown in TensorBoard 


Figure 4-55 shows the plotting of the loss during the training process through the 
epochs of training. 


0.028 
0.026 
ae 
0.024 


0.022 














Smoothed Value Step Time Relative 


©) DenoisingAutoencoder? 0.01399 0.01382 19 Wed Jun 12,22:58:39 3m 39s 





Figure 4-55. Plotting of loss shown in TensorBoard 


161 


CHAPTER 4 AUTOENCODERS 


Figure 4-56 shows the plotting of the accuracy of validation during the training 
process through the epochs of training. 


Val_ace 


Smoothed Value Step Time Relative 


‘@) DencisingAutoencoder? 0.8103 Oey 18 Wed Jun 12, 22:58:28 3m 28s 





Figure 4-56. Plotting of validation accuracy shown in TensorBoard 


Figure 4-57 shows the plotting of the loss of validation during the training process 
through the epochs of training. 


Val_loss 


oO 2 4 6 8 10 12 


Figure 4-57. Plotting of validation loss shown in TensorBoard 


162 


CHAPTER 4 AUTOENCODERS 


Variational Autoencoders 


A variational autoencoder is a type of autoencoder with added constraints on the 
encoded representations being learned. More precisely, it is an autoencoder that learns 
a latent variable model for its input data. So instead of letting your neural network 
learn an arbitrary function, you learn the parameters of a probability distribution 
modeling your data. If you sample points from this distribution, you can generate new 
input data samples. This is the reason why variational autoencoders are considered to be 
generative models. 

Essentially, VAEs attempt to make sure that encodings that come from some known 
probability distribution can be decoded to produce reasonable outputs, even if they are 
not encodings of actual images. 

In many real-world use cases, we have a whole bunch of data that we’re looking 
at it (it could be images, it could be audio or text; well, it could be anything) but the 
underlying data that needs to be processed might be lower in dimensions than the 
actual data, so lot of the machine learning models involve some sort of dimensionality 
reduction. One very popular technique is singular value decomposition or principal 
component analysis. Similarly, in the deep learning space, variational autoencoders do 
the task of reducing the dimensions. 

Before we dive into the mechanics of variational autoencoders, let’s just recap 
the normal autoencoders that you saw in this chapter. Autoencoders basically use an 
encoder and decoder layer at a minimum to reduce the input data features into a latent 
representation by the encoder layer. The decoder expands the latent representation 
to generate the output with the goal of training the model well enough to reproduce 
the input as the output. Any discrepancy between the input and output could signify 
some sort of abnormal behavior or deviation from what is normal, otherwise known as 
anomaly detection. In a way, the output gets compressed into a smaller representation 
but has less dimension than the input, and this is what we call the bottleneck. From the 
bottleneck, we try to reconstruct the input. 

Now that you have the basic concept of the normal autoencoders, let’s look at the 
variational autoencoders. In variational autoencoders, instead of mapping the input toa 
fixed vector, we map the input to a distribution so the big difference is that the bottleneck 
vector seen in the normal order in quarters is replaced with the mean vector and a 
standard deviation vector by looking at the distributions and then taking the sampled 
latent vector as the actual bottleneck. Clearly this is very different from the normal 
autoencoder where the input directly yields a latent vector. 


163 


CHAPTER 4 AUTOENCODERS 


First, an encoder network turns the input sample x into two parameters in a latent 
space, which you can call z_ mean and z_log sigma. Then, you randomly sample similar 
points z from the latent normal distribution that is assumed to generate the data, 
viaz = z_ mean + exp(z_log sigma) * epsilon, where epsilon is arandom normal 
tensor. Finally, a decoder network maps these latent space points back to the original 
input data. Figure 4-58 depicts the variational encoder neural network. 


Variational Autoencoder 


Mean 


Encoder Sampling 5 Decoder 





Standard 
Deviation | 


Figure 4-58. The variational encoder neural network 


The parameters of the model are trained via two loss functions: a reconstruction 
loss forcing the decoded samples to match the initial inputs (just like in the previous 
autoencoders), and the KL divergence between the learned latent distribution and the 
prior distribution, acting as a regularization term. You can actually get rid of this latter 
term entirely, although it does help in learning well-formed latent spaces and reducing 
overfitting to the training data. 


164 


CHAPTER 4 AUTOENCODERS 


The distribution that you’re learning from is not too far removed from a normally 
distributed so you going to try to force your latent distribution to be relatively close to 
a mean of zero and a standard deviation of one so before you can train your variational 
autoencoder you must consider that there is a sampling problem that could happen. 
Since you are only taking a sample of the distribution from the mean vector and the 
standard deviation, it is harder to realize backpropagation there. You are sampling it so 
how do you get back during the back propagation step? 

A variational autoencoder is a kind of a mix of neural networks and graphical models 
since the first paper came up on variational autoencoder tried to create a graphical 
model and then turn the graphical model to a neural network. The variational auto 
encoder is based on variational inference. 

Assume that there are two different distributions, p and q, and that you can use KL 
divergence to show dissimilarity between the two distributions, p and q. Thus, a KL 
divergence serves as a measure of the similarity between the two distributions, p and q. 

The best way to understand the need for a variational autoencoder is that in a general 
autoencoder, the bottleneck is too dependent on the inputs and there is no understanding 
of the nature of the data. Since you use sampling of the distribution instead, you will be 
able to better accommodate the model to new types of data. 

Figure 4-59 shows the basic code to import all necessary packages in Jupyter. Also 
note the versions of the various necessary packages. 


165 


CHAPTER 4 AUTOENCODERS 


import keras 

from keras import optimizers 

from keras import losses 

from keras import backend as kK 

from keras.models import Sequential, Model 

from keras.layers import Lambda, Dense, Input, Dropout, Embedding, LSTM 
from keras.optimizers import RMSprop, Adam, Nadam 

from keras.preprocessing import sequence 

from keras.callbacks import Tensorsoard 

from keras.losses import mse, binary_crossentropy 


import sklearn 
from sklearn.preprocessing import StandardScaler 


from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, roc_auc_score 
from sklearn.preprocessing import MinMaxScaler 


import seaborn as sns 
import pandas as pd 
import numpy as np 
import matplotlib 


import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 
xmatplotlib inline 


import tensorflow 
import sys 
print("Python: ", sys.version) 


print("pandas: ", pd.__version_) 
print("numpy: ", np.__version_) 
print("seaborn: ", sns.__version_) 
print("matplotlib: ", matplotlib._version_) 
print("sklearn: ", sklearn.__version_) 
print("Keras: ", keras.__version_) 
print("Tensorflow: “, tensorflow.__version_) 


Python: 3.7.1 (default, Dec 1@ 2018, 22:54:23) [MSC v.1915 64 bit (A'D64)] 
pandas: @.24.2 
numpy: 1.16.3 

seaborn: 96.9.0 
matplotlib: 3. 
sklearn: 06.20. 
Keras: 2.2.4 

Tensorflow: 1.13.1 


8.3 
3 


Figure 4-59. Code to import packages in Jupyter 


Figure 4-60 shows the code to visualize the results via a confusion matrix, a chart for 
the anomalies, and a chart for the errors (difference between predicted and truth) while 


training. 


166 


CHAPTER 4 AUTOENCODERS 


class Visualization: 
labels = ["Normal", “Anomaly"] 


def draw_confusion_matrix(self, y, ypred): 
matrix = confusion_matrix(y, ypred) 


plt.figure(figsize=(18, 8)) 

colors=[ “orange”, "green"] 

sns.heatmap(matrix, xticklabels=self.labels, yticklabels<self.labels, cmap-colors, annot=True, fmt<"d") 
plt.title("Confusion matrix”) 

plt.ylabel( ‘Actual’) 

plt.xlabel( ‘Predicted’ ) 

plt.show() 


def draw_anomaly(self, y, error, threshold): 
groupsOF = pd.DataFrame({‘error'’: error, 
‘true’: y}).groupby( true’) 


figure, axes = plt.subplots(figsize=(12, 8)) 


for name, group in groupsOF: 
axes .plot(group.index, group.error, marker='x' if name == 1 else ‘o', linestyle-"', 
color='r‘ if name == 1 else ‘g', label="Anomaly” if name == 1 else “Normal”) 


axes .hlines(threshold, axes.get_xlim()[@], axes.get_xlim()[1], colors="b", zorder-1ee, label="Threshold') 
axes. legend() 


plt.title("Anomalies”) 
plt.ylabel("Eerror”) 
plt.xlabel("Data”™) 
plt.show() 


def draw_error(self, error, threshold): 
plt.plot(error, marker='0", mS=3.5, linestyle="", 
label='Point") 
plt.hlines(threshold, xmin=@, »max=len(error)-1, colors="b", zorder=100@, label="Threshold') 
plt.legend() 
plt.title("Reconstruction error”) 
plt.ylabel("Error”) 
plt.xlabel("Data”") 
plt.show() 


Figure 4-60. Code to visualize the results 


You will use the example of credit card data to detect whether a transaction is 
normal/expected or abnormal/anomaly. Figure 4-61 shows the data being loaded into a 
Pandas dataframe. 


167 


CHAPTER 4 AUTOENCODERS 


filePath = *./creditcardanomalydetection.csv' 
df = pd.read_csv(filepath_or_buffer=filePath, header=0, sep=",') 
print(df.shape[@]) 


df .head() 
284807 
Time vi v2 v3 v4 v5 vé v7 vs ee v2 v22 v23 Vv24 v2: 
0 0.0 -1.350807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.230500 0.008608 0.263787 .. -0.018307 O277838 -0.110474 0.066028 0.12853 
1 0.0 1.101857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 .. -0.225775 -0.638672 0.101288 -0.330846 0.167171 
2 1.0 -1.358354 -1.340163 1.773200 0370780 -0.5031908 15800400 0.701461 0247676 -1.514654 0.247008 0.771679 0.000412 -0.680281 -0.32764: 
3 1.0 0.966272 -0.185226 1.792003 -0.963201 -0.010309 1.247203 0.237609 0.377436 -1.387024 0.108300 0.005274 -0.190321 -1.175575 0.64737" 
4 20 -1.158233 O.877737 1.548718 0.403034 -0.407193 0.005021 0.592041 -0.270533 0.817739 .. -0.000431 0.708278 -0.137458 0.141267 -0.20601% 


5 rows x 31 columns 


Figure 4-61. Code to load the dataset using Pandas 


You will collect 20k normal and 400 abnormal records. You can pick different ratios 
to try, but in general more normal data examples are better because you want to teach 
your autoencoder what normal data looks like. Too much of abnormal data in training 
will train the autoencoder to learn that the anomalies are actually normal, which goes 
against your goal. Figure 4-62 shows the code to take the majority of normal data records 


with a few abnormal records. 


df["Amount'] = Standardscaler().fit_transform(df['Amount"].values.reshape(-1, 1)) 
dfe = df.query('Class <= @').sample(2eeeo) 

dgf1 = df.query('Class == 1°).sample(4e@) 

df = pd.concat((dfe, df1]) 


Figure 4-62. Code to take the majority of normal data records with a few 
abnormal records 


Split the dataframe into training and testing data sets (80-20 split). Figure 4-63 shows 
the code to split the data into train and test subsets. 


x_train, x_test, y_train, y_test = train_test_split(d?.drop(labels=[‘Time’, ‘Class'], axis = 1), 
df[‘Class'), test_size=90.2, random_state=42) 

print(x_train.shape, ‘train samples’) 

print(x_test.shape, ‘test samples*) 


(16326, 29) train samples 
(4086, 29) test samples 


Figure 4-63. Code to split the data into train and test subsets 


168 


CHAPTER 4 AUTOENCODERS 


The biggest difference between the standard autoencoders you have seen so far and 
the variational autoencoder is that here you do not just take the inputs as is; rather, you 
take the distribution of the input data and then sample the distribution. Figure 4-64 
shows the code to implement such a sampling strategy. 


# reporameterization trick 
# instead of sampling from O(z/X), sample epsilon = N(é@,I) 
#2 = Z mean + Sqrt(ver) * epsilon 
def sampling(args): 
“"““Reparameterization trick by sampling from an isotropic unit Gaussian. 
# Arguments 
args (tensor): mean and log of variance of Q(z|X) 
# Returns 
z (tensor): sampled latent vector 


zZ_mean, z_log_var = args 

batch = K.shape(z_mean)[@) 

dim = K.int_shape(z_mean)[1] 

# by default, random_normal has mean = @ and std = 1.8 
epsilon = K.random_normal(shape=(batch, dim)) 

return z_mean + K.exp(e@.5 * z_log var) * epsilon 


Figure 4-64, Code to sample the distributions 


Now it’s time to create a simple neural network model with an encoder and a 
decoder phase. You will encode the 29 columns of the input credit card dataset into 12 
features using the encoder. The encoder uses the special distribution sampling logic 
to generate two parallel layers and then wraps the sampling output (above) as a Layer 
object. 

The decoder phase uses this latent vector and reconstructs the input. While doing 
this, it also measures the error of reconstruction in order to minimize it. Figure 4-65 


shows the code to create the neural network. 


169 


CHAPTER 4 AUTOENCODERS 


original_dim = x_train.shape[1] 
print(original_dim) 


input_shape = (original_dim,) 
intermediate_dim = 12 
batch_size = 32 

latent_dim = 2 

epochs = 26 


# VAE model = encoder + decoder 

# build encoder model 

inputs = Input(shape=-input_shape, name="encoder_input') 
x = Dense(intermediate_dim, activation='relu’)(inputs) 
Z_mean = Dense(latent_dim, name="z_mean")(x) 

z_log_ var = Dense(latent_dim, name="z_log_var')(x) 


# use reparameterization trick to push the sampling out as input 
# note that "“output_shape" isn't necessary with the TensorFlow backend 
z = Lambda(sampling, output_shape=-(latent_dim,), name='z')([z_mean, z_log_var]) 


# instantiate encoder model 
encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder") 
encoder. summary () 


# build decoder model 

latent_inputs = Input(shape=(latent_dim,), name="z_sampling‘) 
x = Dense(intermediate dim, activation='relu')(latent_inputs) 
outputs = Dense(original dim, activation='sigmoid')(x) 


# instantiate decoder model 
decoder = Model(latent_inputs, outputs, name='decoder') 
decoder. summary () 


# instantiate VAE model 
outputs = decoder(encoder(inputs)[2]) 
vee = Model(inputs, outputs, name="vae_mlp") 


# VAE Loss = mse_Loss or xent_Loss + kl_Loss 
reconstruction_loss = mse(inputs, outputs) 


reconstruction_loss *= original_dim 

kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var) 
kl_loss = K.sum(kl_loss, axis=-1) 

kl_loss *= -@.5 

vae_loss = K.mean(reconstruction_loss + kl_loss) 
vae.add_loss(vae_loss) 


Figure 4-65. Code to create the neural network 


Figure 4-66 shows the code to show the neural network. 


170 


CHAPTER 4 AUTOENCODERS 


29 
Layer (type) Output Shape Param # Connected to 
encoder_input (Inputuayer) (None, 29) ®t 
dense_15 (Dense) (None, 12) 366 encoder_input[@][@] 
Zmean (Dense) (None, 2) 26 dense_15(@][9] 
z_log.var (Dense) (None, 2) 26  ©denseas(ej(e) 
z (Lambda) (None, 2) Q zZ_mean[@)[@] 
z_log_var[e][e] 


Total params: 41 
Trainable params: 412 
Non-trainable params: @ 


Layer (type) Output Shape Param # 

2 sampling (Inputtayer) (None, 2) 2=*=*~C<“<iC‘<‘CU”tSSS 
dense_16 (Dense) (None, 12) 36 
dense_17 (Dense) (None, 29) 377 


Total params: 413 
Trainable params: 413 
Non-trainable params: @ 


Figure 4-66. Code to show the neural network 


Compile the model using adam as the optimizer and mean squared error for the 
loss computation. Adam is an optimization algorithm that can be used instead of the 
classical stochastic gradient descent procedure to update network weights iteratively 
based on training data. Figure 4-67 shows the code to compile the model. 


vae.compile(optimizer=‘adam', 
loss="mean_squared_error’, 


metrics=[ ‘accuracy’ ]) 
vae.summary() 
Layer (type) Output Shape 6 .§ FParama# 


encoder_input (InputLayer) (None, 29) 8 
encoder (Model) [(None, 2), (None, 2), (N 412 
decoder (Model) (None, 29) 413 


————— == 


Total params: 825 
Trainable params: 825 
Non-trainable params: © 


Figure 4-67. Code to compile the model 


171 


CHAPTER 4 AUTOENCODERS 


Now, you can start training the model using the training dataset to validate the 
model at every step. Choose 32 as the batchsize and 20 epochs. The training process 
outputs the loss and accuracy as well as the validation loss and validation accuracy at 
each epoch. Figure 4-68 shows the code to train the model. 


history = vae.fit(x_train, x_train, 
batch_size=batch_size, 
epochs=epochs, 
verbose=-1, 
shuffle-True, 


validation data=(x_test, x_test), 
callbacks=<(TensorBoard(log_dir="../logs/variationalautoencoderi*))) 


WARNING: tensorflow: From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops \\math_ops .py:3066: to_int32 (from 


rflow.python.ops.math_ops) is deprecated and will be removed in a future version. 


Instructions for updating: 
use tf.cast instead. 


Train on 16326 samples, validate on 4ege samples 


Epoch 1/28 

16320/16320 [sseeeseeeeseessesesereseseses= | 
ri 

Epoch 2/20 

3 

Epoch 3/20 

6 

Epoch 4/26 

9 
Epoch $/20 

i 

Epoch 6/20 

16320/16320 [sseuseecucecessecesssseseszecs | 
| 

Epoch 7/20 

16320/16320 [semmeneeeneeneeseeseeseseesees | 
z 

Epoch 8/20 

1 

Epoch 9/20 

163276/16320 | 
i 

Epoch 1e/20 

@ 

Epoch 11/208 

16326/16326 [seeenecezceeesscescecessceeszs: | 
4 

Epoch 12/20 


16320/16320 [seeeeseeeseeeessessessessesess | 


Epoch 13/20 

16326/16320 [sumeeeeeeceeneececeececcescecs | 
5 

Epoch 14/20 

163276/163270 [=seeewseeersrersrcersersersrsss | 


a 

Epoch 15/20 

Q 

Epoch 16/20 

16320/16320 [sseeesseereeeeseceescssssscesezs | 


4 


= 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


35 


195us/step 


1é4us/step 


1é64us/step 


163us/step 


Leous/step 


159us/step 


léius/step 


157us/step 


1é9us/step 


200us/step 


172us/step 


175us/step 


283us/step 


174us/step 


175us/step 


177us/step 


Figure 4-68. Code to train the model 


172 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


1055: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


50. 


45 


45 


45 


4s. 


45, 


45 


4s. 


2655 


24 


»8263 


O72 


4739 


3334 


237@ 


»1393 


8787 


8359 


7879 


acc: 


acc: 


ace: 


acc: 


acc: 


acc: 


acc: 


acc: 


Bcc. 


acc: 


acc: 


acc: 


acc: 


acc. 


acc: 


acc: 


»2365 


.2431 


» 2563 


» 2622 


26g 


2729 


2786 


. 2888 


- 2053 


. 3180 


-3119 


»3257 


» 3345 


2461 - 


= 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


val_loss: 


49 


6724 


-8029 


- 5003 


3459 


-2021 


48.8794 


ay. 


9675 


47. 


9229 


47 


-8419 


47. 


6565 


47 


-5955 


47. 


5993 


47. 


5176 


47,4447 


47.4417 


47.3702 


val_acc: 


val_acc: 


val_acc: 


wal_acc: 


val_ace: 


val_acc: 


val_ace: 


val_acc: 


val_acc: 


val_acc: 


val_acc: 


val_acc: 


wal_acc: 


val_acc: 


val_acc: 


val_acc: 


tenso 


6.229 


247 


-242 


243 


«255 


-254 


«263 


«271 


- 280 


-287 


294 


289 


8.301 


-326 


+323 


-329 


CHAPTER 4 AUTOENCODERS 


Now that the training process is complete, let’s evaluate the model for loss and 
accuracy. Figure 4-69 shows that the accuracy is 0.23. It also shows the code to evaluate 
the model. 


score = vae.evaluate(x_test, x_test, verbose=1) 
print('Test loss:', score[@]) 
print('Test accuracy:", score[1])| 


4030/4030 | =s==sxxxmssenenceszecsessezeze==) - 65 60u5/Step 


Test loss: 48.38297452739641 
Test accuracy: ©.235294117647085382 


Figure 4-69. Code to evaluate the model 


The next step is to calculate the errors, and detect and also plot the anomalies and 
the errors. Choose a threshold of 10. Figure 4-70 shows the code to predict the anomalies 
based on the threshold. 


threshold=10.90 
y_pred = vae.predict(x_test) 
y_dist = np.linalg.norm(x_test - y_pred, axis=-1) 
Z = zip(y_dist >= threshold, y_dist) 
y_label=[ ] 
error = [] 
for idx, (is_anomaly, y_dist) in enumerate(z): 

if is_anomaly: 

y_label.append(1) 

else: 
y_label.append(@) 
error.append(y_dist) 


Figure 4-70. Code to predict the anomalies based on the threshold 


Compute the AUC (Area Under the Curve 0.0 to 1.0); it comes up as 0.93, which is 
very high. Figure 4-71 shows the code to calculate the AUC. 


roc_auc_score(y_test, y_label) 


@.9345736547003569 


Figure 4-71. Code to calculate AUC 


You can now visualize the confusion matrix to see how well you did with the model. 


Figure 4-72 shows the code to show the confusion matrix. 


173 


CHAPTER 4 AUTOENCODERS 


viz = Visualization() 
viz.draw_confusion_matrix(y_test, y_label) 


Confusion Matrix 





Normal soomaly 
Predxted 


Figure 4-72. Code to show the confusion matrix 


Using the predictions of the labels (normal or anomaly) you can plot the anomalies 
in comparison to the normal data points. Figure 4-73 shows the anomalies relative to the 
threshold. 


174 


CHAPTER 4 AUTOENCODERS 


viz.draw_anomaly(y_test, error, threshold) 





0 50000 100000 1$0000 200000 750000 300000 


Figure 4-73. Showing the anomalies relative to the threshold 


Figure 4-74 shows the graph of the model as visualized by TensorBoard. 





Exp 
. ne yal 
, were. mma . * = : be 
_ _— 7 _ 
P / 4 
ms -t > meg 
= isVariediel_ isVaradiel 
\ 
” ~ , 
} ~ o 


encoder 7 
Cc > suelt-8] 


Figure 4-74. Model graph shown in TensorBoard 


175 


CHAPTER 4 AUTOENCODERS 


Figure 4-75 shows the graph of the model as visualized by TensorBoard. 





i oe + aims x be r —-. 
winzin_ wie. 7} =< a. auseg Gense_ 4 I 


Baad a 


a 
rele TETAS in 


Figure 4-75. Model graph shown in TensorBoard 


Figure 4-76 shows the plotting of the accuracy during the training process through 


the epochs of training. 


Marne Smocthed Value Step Time Relative 





i oO Varsionnuinencoder! §ola2od Oae45 15 Mon Jun 10, 0SES0 ate 


Figure 4-76. Plotting of accuracy shown in TensorBoard 


Figure 4-77 shows the plotting of the loss during the training process through the 


epochs of training. 


176 


CHAPTER 4 AUTOENCODERS 


Smecthed Value Step Time Relative 
8) vemlnnaleuioencmer] 45.19 4514 7 Mon Jun 10, 00:5827 18s 





Figure 4-77. Plotting of loss shown in TensorBoard 


Figure 4-78 shows the plotting of the accuracy of validation during the training 
process through the epochs of training. 


Mame Smoothed Value Step Time Relative 
f © varistionalauioencoder? 0.7164 O22) 13 MonJun 10, 005844 2s 





Figure 4-78. Plotting of validation accuracy shown in TensorBoard 


Figure 4-79 shows the plotting of the loss of validation during the training process 
through the epochs of training. 


177 


CHAPTER 4 AUTOENCODERS 


va lo +> 


Figure 4-79. Plotting of validation loss shown in TensorBoard 


Summary 


In this chapter, we discussed autoencoders, types of autoencoders, and how they can 
be used to build anomaly detection engines. We looked at implementing a simple 
autoencoder and sparse, deep, convolutional, and denoising autoencoders. We also 
explored the variational autoencoder as a means to detect anomalies. 

In the next chapter, we will look at another method of anomaly detection, 
Boltzmann machines. 


178 


CHAPTER 5 


Boltzmann Machines 


In this chapter, you will learn about Boltzmann machines and how the restricted 
Boltzmann machine can be used to perform anomaly detection. 


In a nutshell, the following topics will be covered throughout this chapter: 
e Whatis a Boltzmann machine? 
e Restricted Boltzmann machines (RBMs) 


e RBM applications 


What Is a Boltzmann Machine? 


A Boltzmann machine is a special type of bidirectional neural network comprised 
only of hidden nodes and input nodes, designed to learn the probability distribution 
of a data set. What makes a Boltzmann machine special is that each and every node is 
interconnected to each other, meaning the neurons in the hidden layer are connected 
to each other as well. Additionally, the Boltzmann machine has fixed weights, and the 
nodes make stochastic (probabilistic) decisions about whether or not to fire. 

To better understand the model, let’s take a look at an example in Figure 5-1. 


179 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_5 


CHAPTER 5 BOLTZMANN MACHINES 


Hidden Nodes 





Wac is the weight 
between nodes B and 


Visible Nodes 


Figure 5-1. A graph showing how a Boltzmann machine can be structured. Notice 
that all of the nodes are interconnected, even if they are in the same layer 


Despite there being a distinction between visible nodes and hidden nodes, that 
doesn’t matter in a Boltzmann machine. In this model, every node communicates to 
every other node, and the entire model works as a system to create a generative network 
(meaning it’s capable of generating its own data based on what it has learned by fitting 
on a data set). In Boltzmann machines, the visible nodes are what we can interact with; 
we can’t interact with the hidden nodes. One more distinction to make is that there is no 
training process; the nodes learn to model the data set as best as they can on their own, 
making the Boltzmann machine an unsupervised deep learning model. 

However, Boltzmann machines aren't necessarily that practical, and they suffer from 
problems when the network is scaled up in size. Specific derivations of the Boltzmann 
machine such as restricted Boltzmann machines (RBM), deep Boltzmann machines 
(DBM), and deep belief networks (DBN) are much more suitable and practical to work 
with, although they are a bit outdated and have no support from the major frameworks 
such as Keras, TensorFlow, and PyTorch. Despite that, they still see some new uses today, 
even though they are overshadowed by newer deep learning models. For our purposes, we 
will look at applying the RBM to anomaly detection, particularly because it is the easiest of 
the three Boltzmann machine derivations to implement and because it is simpler to work 
with when we consider the mathematics (which are still at an advanced level) at play. 


180 


CHAPTER 5 BOLTZMANN MACHINES 


Restricted Boltzmann Machine (RBM) 


The RBM is similar to the Boltzmann machine in that it is an unsupervised, stochastic 
(probabilistic), generative deep learning model. However, a key difference is that 

the RBM is only comprised of two layers: the input layer and the hidden layer. Its 
architecture is similar to that of the artificial neural network model you explored in 
Chapter 3, with the RBM layers looking like the first two layers of an ANN. Because 

we place a restriction on the layers that none of the nodes within their own layer are 
to be interconnected, the model is termed as a restricted Boltzmann machine. More 
specifically, since each node outputs a binary value, we are dealing with a Boolean/ 
Bernoulli RBM. Figure 5-2 shows an RBM. 


Hidden 
Layer 





Figure 5-2. A visual representation of a basic restricted Boltzmann machine 


We can expand this model out even more to include biases (see Figure 5-3). 


181 


CHAPTER 5 BOLTZMANN MACHINES 


Hidden 
Layer 





Figure 5-3. A visual representation of a restricted Boltzmann machine with a 
different bias feeding into each of the two layers 


Bias a adds to all of the outputs of the input layer, and bias b adds to the outputs of 
the hidden layer. From here, we can define what is called the energy function, which the 
RBM tries to minimize. The energy function is shown in Figure 5-4. 


E(v,h) = — » aj;V; — » b,h, _ » v;hj Wij 
j — 


L LJ 


Figure 5-4. A formula that defines the energy function of the restricted Boltzmann 
machine 


The first summation term is an element-wise multiplication between bias a and 
visible layer v, where each term a; is multiplied with each term v;. The second summation 
term follows the same logic, except uses element-wise multiplication with bias b and 
hidden layer h. Finally, the last summation term multiplies each visible node v; with 
each hidden node h, and the weight value w;, for that connection. 

The summations are basically element-wise multiplication between two vectors, one 
being transposed, so 1xn (1 column n rows), and the other being nx1 (n columns 1 
row). When a vector or matrix is transposed, we reverse the dimensions of the vector/ 
matrix and rearrange the values. In a vector, the same values in a row/column are now in 
a column/row. For matrices, it’s a bit more complex. To better understand the concept of 
transposing a vector or matrix, refer to Figure 5-5, Figure 5-6, and Figure 5-7. 


182 


CHAPTER 5 BOLTZMANN MACHINES 


Vector vs. Transposed Vector (Figure 5-5) 


A= [1542] 


ee On 


Figure 5-5. Original vector vs. its transposed version 


Square Matrix vs. Transposed Square Matrix (Figure 5-6) 





Figure 5-6. Original matrix vs. its transposed self. Note how the entries seem to be 
flipped along the diagonal 


Matrix (nxm) vs. Transposed Matrix (mxn) (Figure 5-7) 


(BoE 


Figure 5-7. Original nxm matrix vs. its transposed mxn self. The columns of the 
original matrix C become the rows of the transposed matrix C* 


Rewriting the summations to reflect the multiplication of the respective vectors, one 
being transposed, the energy function is equivalent to the equation in Figure 5-8. 


E(v,h) = -a'v —b'h—-—v'Wh 


Figure 5-8. The equivalent formula for the energy function written without 
summations 


Using the energy function, we can define a probability function that will output the 
probability of the network having a specific (v,h). To elaborate on v and h, vis a vector 
that represents the states of each node in the input layer, and h is a vector that represents 
the states of each node in the hidden layer. 


183 


CHAPTER 5 BOLTZMANN MACHINES 


The probability function is shown in Figure 5-9, given a specific (v,h). 


1 
p(v,h) = Seon) 


Figure 5-9. The probability function that is associated with the visible layer v and 
the hidden layer h 


Z is defined as shown in Figure 5-10. 


L= » e EW h) 


v,h 


Figure 5-10. Z performs the operation over every possible v and h in the data set, 
so you can see how it forms a probability function. (Say you want a probability of 
all hearts in a card deck. This is 13/52, with 13 being all of the hearts and 52 being 
the total number of cards.) 


Z is the sum of the function e~*””) over every single pair of input and hidden layer 
state vectors (a vector representing the states of the layer). The parameters passed into 
p(v,h) are supposed to be vectors representing a specific configuration of the two layers 
in terms of what neurons are activated. 

You can see how this forms a probability function, since we want to find e*”” for 
some v, h over the sum of e~*”” for all possible pairs of v, h. 

We can go a step further and define formulas for the probability of v or h given h or v 
(see Figure 5-11 and Figure 5-12). 


_ phy) 7 
p(hlv) == = | [rcs v) 


Figure 5-11. Formula for the probability of the hidden layer being in the state h 
given the visible layer being in the state v 





_P@h) _ 7 
p(vlh) =o =| [room 


Figure 5-12. Formula for the probability of the hidden layer being in the state v 
given the visible layer being in the state h 


184 


CHAPTER 5 BOLTZMANN MACHINES 


The II works similarly to X, except with multiplication instead of addition. 
Essentially, p(h | v) is the multiplication of every p(h,, v) that exists. In these cases, m is 
the number of hidden nodes, and n is the number of visible nodes. 

This could be a bit complex, so just know that the formulas in Figure 5-11 and 5-12 
are basically to find the probabilities of v or h being in their states given their respective 
h or v layer counterparts. 

From there, we can define two more formulas regarding the probability that a 
particular node v; or h, activates given the vector h or v, respectively (see Figure 5-13 and 
Figure 5-14). 


m 
pV. = 1 ihy= oa + » w; ;h;) 


a 


Figure 5-13. The probability of one particular node v; activating given the 
multiplication of the weights between v; and every single hidden node added with 
the bias 


n 
pth; =1|v)= o(b, + » W; jV; 
i=1 


Figure 5-14. The probability of one particular node h, activating given the 
multiplication of the weights between h; and every single visible node added with 
the bias 


The o represents the sigmoid function, defined by the formula in Figure 5-15. 


@)= aes 
OS Thee ek +1 


Figure 5-15. The formula for a sigmoid function 


Finally, given training inputs, we want to maximize the joint probability of the inputs, 


given by the formula in Figure 5-16. 


arg max | | p(v) 


veV 


Figure 5-16. We are maximizing the joint probability of every possible visible 
node (the inputs) with respect to the weights 


185 


CHAPTER 5 BOLTZMANN MACHINES 


Essentially, we will end up with a huge chain of multiplication of p(v) for every 
possible v given V, the set of all possible training inputs. We take that, and we want to 
maximize that product with respect to the weights W, so we want the weights to be 
increasing the joint probability (that product of all possible v layers). 

We can also rewrite this in terms of maximizing the expected value of the log 
probability as shown in Figure 5-17. 





> oer) 


veV 


arg max E 
W 


Figure 5-17. We take the log of p(v) for some v that’s a part of the whole training set 
V. Then we sum those terms up (think back to the log rules) and find the average of 
them all. That is what we want to maximize with respect to the weights W 


The notation E | |] stands for the expected value. In probability, E(X) is the expected 
value of some random variable X and can be thought of as the mean. In our case, we are 
trying to maximize the mean value of the log probability. Once again, V is the set of all 
training inputs. 

So to explain what the formula means, we use log rules to rewrite the joint 
probability as a summation instead, and then we seek to maximize the average of that 
sum with respect to W, the weights. We want to adjust the weights so that we continue to 
maximize this expected value for every input in the entire training set. 

The formulas pertaining to the RBM can get more complicated and detailed, but the 
ones listed so far should hopefully be enough to help you gain a good understanding 
of what an RBM is and how it works. At its core, the RBM is a probabilistic model that 
operates in accordance with a set of formulas. Additionally, the goal of the formulas is to 
help the RBM learn a probability distribution to represent V, explaining why the RBM is 
an unsupervised learning algorithm. 

As for the training algorithm, there are two choices: contrastive divergence (CD) 
and persistent contrastive divergence (PCD). These algorithms both use Markov 
chains to help the training algorithm determine what direction to perform the gradient 
calculations in, but both differ and have their pros and cons. PCD can get better 
samples of the data and explore the domain of the input space better, but CD is better at 


extracting features. 


186 


CHAPTER 5 BOLTZMANN MACHINES 


Some RBMs might also incorporate a feature known as momentum, which basically 
allows for an increase in learning speed and can be thought of as simulating a ball rolling 
down a hill in terms of optimizing the target function. (Think back to gradient descent 
and how the goal is to get to a local minimum. As the “ball” rolls towards the minimum, 
it gains “Momentum” and descends faster and faster. Once it overshoots, it will gain new 
momentum in the opposite direction, incentivizing it to reach the minimum faster). 

There are more intricacies to the RBM, but in the end, you only need to know that 
RBMs can be used to create a probability distribution of the input data. We will use this 
property of RBMs to single out anomalies by checking the probability of that particular 


sample of occurring. 


Anomaly Detection with the RBM - Credit Card Data Set 


Now that you know more about the complex mechanisms of the RBM, let’s apply the RBM 
to a data set and see how it performs. For your application, let’s use the credit card data set, 
which can be found at www. kaggle.com/mlg-ulb/creditcardfraud/version/3. 

Begin by importing all of your packages. For this application, you will only explore 
how an RBM can be applied to the code, since the source code is quite large. However, 
you can access the source code through the GitHub link at https: //github.com/ 
aaxwaz/Fraud-detection-using-deep- learning. 

Simply download the folder titled rbm and place it in your working directory 
(wherever you have your notebook file or Python file). In this case, we placed in a folder 
named boltzmann_ machines. 

Now, import your modules (see Figure 5-18). 


import pandas as pd 
import tensorflow as tf 
from sklearn.metrics import roc auc score as auc 


import matplotlib.pyplot as plt 


from sklearn.model selection import train test split 


from sklearn.preprocessing import StandardScaler 


from boltzmann machines.rbm import * 





smatplotlib inline 


Figure 5-18. Importing all the modules you need. %matplotlib inline is to save the 
graph within the Jupyter notebook itself 
187 


CHAPTER 5 BOLTZMANN MACHINES 


Next, import the data set. 
Run the following (refer to Figure 5-19 for the output): 


df = pd.read csv("datasets/creditcardfraud/creditcard.csv", sep=",", 
index_col=None, encoding="utf-8-sig") 


In [62]: df = pd. read_csv("datasets/creditcardiraud/creditcard.csv", sep=",", index_col=None, encoding="utf-8-sig") 
df. head (5) 
Gut [62] : 
V5 V6 vT Va vo .. Vii Wiz VW23 Vid V5 V26 V2? V28 Amount Class 
0.333321 0.462388 0.239599 0.098698 O363787 .. -0.018307 O.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0 
O=O8001S -0.082961 -D078803 O095102 -0.255425 O225775 -6SS672 O.101283 -03r0ede CBT 170 0.125895 -0.008893 O.0idred 2.69 0 
0.503198 1800499 O791461 O247676 -1514654 .. O247998 OF7TIG79 0908412 -0.680281 -0.327642 -0.130097 -0.055353 -0.059752 378.66 Oo 
0.010309 L2d7203 O2397609 O37F436 -1 38704 “0.108300 O.005274 -0.790321 -1.775575 O<.647976 0221929 O.062723 0.061458 6123.50 i] 
0.407193 0.095921 05927941 -0270533 O817739 .. -0.009431 O798278 0.197458 O.141267 1.206010 O502292 O219422 0215153 69.99 0 


Figure 5-19. Visualizing the data set you just loaded. This figure is scrolled right to 
show the classes 


Looking at the data, it seems that the values in the columns Amount and especially 
Time need to be normalized. Take a look at how large the values for time get 
(see Figure 5-20). 


In [3] df.tail() 
Out [3] 
Time v1 v2 v3 v4 
284802 172786.0 -11.881118 10.071785 -9.834783 -2.066656 
284803 172787.0 -0.732789 -0.055080 2.035030 -0.738589 
284804 172788.0 1.919565 -0.301254 -3.249640 -0.557828 
284805 172788.0 -0.240440 0.530483 0.702510 0.689799 
284806 172792.0 -0.533413 -0.189733 0.703337 -0.506271 


2 rows * 31 columns 


Figure 5-20. Looking at the tail end of the data frame (bottom five entries), the 

values for time clearly become massive. You must address this in order to train the 
RBM and ensure that the training process goes smoothly and works properly. Large 
values like this can ruin the whole process and even lead to no convergence 


188 


CHAPTER 5 BOLTZMANN MACHINES 


To avoid numbers like these from potentially ruining the training process, you should 
standardize the values for both columns. Everything else seems to already be standardized, 
so you should only worry about these columns. Run the code in Figure 5-21. 


df['Amount'] = 
stvandardscaler().f1it transtorm(dt["Amounc” |evalues,reshape(—1;, .1):) 


df['Time'] = 
standardscaléer().f£20 transform(dt| Time’ |.valucs.reshaps (-l, 1)) 





Figure 5-21. Standardizing the values in the columns Amount and Time 


Now let’s take a look at the values to see how they were transformed (see Figure 5-22 
and Figure 5-23). 


In [5]: df. head () 


WS Ve Vi We Wo... V2 V22 W23 Wid W25 V26 Ver W28 Amount Class 


338321 O462388 O.279599 O.098698 O=263787 . -O.018307 O2T7TE38 -0.110474 O.086928 O.128539 -0.189115 O.13358 -0.021053 0.244004 0 


O800TE -D.082361 -0TEEOS O.0BS102 -025e5 8S “63862 OLIOTEE -0.599R468 DET TTO (OF12589s “0.008993 D024 -Dae2ays 0 
503198 1.800499 O791461 O247676 -1.514654 O247998 OFS O.909412 -0.689281 -O22Th42 -0.139097 -0.055253 -0.089782 1.160686 it 
010309 1247203 O2397609 O.47436 -1.387024 “0.108300 O.008274 -0.9790327 -19775575 OF=<£6473768 -0.221929 OF£062723 0061458 «(0.140534 o 
407193 0.095921 0.592941 -1.270533 O§=817739 . -0.009431 O7S8278 -0.137458 O.141267 -0.206010 O50e792 O218422 0.215153 -.07T3403 0 


Figure 5-22. Looking at the values for the column Amount to see how they were 
standardized 


In [6]: df.tail ()} 


Out [6]: 
Time V1 V2 V3 V4 V5 








284802 1.641931 -11.881118 10.071785 -9.834783 -2.086656 -6.364473 
284803 1.641952 -0.732789 -0.055080 2035030 -0.738589 0.868229 
284804 1.641974 1.919565 -0.301254 -3.249640 -0.557828 2.630515 
284805 1.641974 -0.240440 0530483 0.702510 0.689799 -0.377961 
284806 1.642058 -0.533413 -0.189733 0.703337 -0.506271 -0.012546 


2 frows = 317 columns 


Figure 5-23. Looking at the values for the column Time to see how they were 
standardized 


189 


CHAPTER 5 BOLTZMANN MACHINES 


Awesome; looking much better. Now, you can define your training and testing data 
sets (see Figure 5-24). 


yan Sei aig di,aloclsZ00000, L:-2] <velues 


y train df.i1Loc([s200000, =-Ll.values 


x test diet Loc|Z00000c, Lr=2).~values 


Y test df.iloc[200000:,-1].values 


print ("Shapes:\nx train:%s\ny train:%s\n" 
(x train.shape, y train.shape) ) 


oO 


print ("x test:%3s\ny test:%s\n" % 
(x. LSesl2enape, Y Test: snepe)) 





Figure 5-24, This is a different process than usual because of how the RBM model 
expects the input 


You should see something like Figure 5-25 as the output. 


In [71l): 
r = df.iloc[:200000, 1:-2].values 
y train = df.iloc(:200000, -1].values 


= df.iloc([2Z00000:, 1:-2].values 
y test = df.iloc(200000:,-1).values 


print ("Shapes:\nx_train:%s\ny train:%s\n" % (x_train.shape, y train. shape) ) 
print("x test:ts\ny test:%s\n" &% (x test.shape, y test.shape) ) 


= 


Shapes: 
x train: (200000, 28) 
y_ train: (200000, ) 


x test: (84807, 28) 
y_test: (84807,) 


Figure 5-25. The output shapes of the training and testing sets 


190 


CHAPTER 5 BOLTZMANN MACHINES 


Getting to the model itself, use the code in Figure 5-26. 


model: = RBM(x train.shape[1], 10, wisable unit. type="gauss', 
main dir='./"', model name='rbm model.ckpt', 


gibbs sampling steps=4, learning rate=0.001, momentum = 0.95, 
batch sizée=512, num epochs=20, verbose= 1) 





Figure 5-26. Initializing the model with a set of parameters 


The parameters are as follows: 


num_visible: The number of nodes in the visible layer 
num_hidden: The number of nodes in the hidden layer 
visible_unit_type: If the visible units are of type binary or gauss 


main_dir: The main directory where to put the models and the 
directories for data and summary 


model_name: The name of the model used when saving 
gibbs_sampling_ steps: (Optional) Default is 1. 


learning_rate: (Optional) Set to the default value of 0.01. Specifies 
the learning rate. 


momentum: The value for momentum to use in gradient descent. 
Default is 0.9. 


12: The 12-weight decay. Default is 0.001. 
batch_size: (Optional) Default is 10. 
num_epochs: (Optional) Default is 10. 


stddev: (Optional) Default is 0.1. Ignored if the visible_unit_type is 
not gauss. 


verbose: (Optional) Default is 0. A value of 1 shows the outputs, and 
0 shows nothing. 


plot_training_loss: Whether or not to plot the training loss. Default is 
True. 


191 


CHAPTER 5 BOLTZMANN MACHINES 


Now you can fit the data to the model. Run the following (refer to Figure 5-27 for the 


output): 


model.fit(x train, validation set=x_ test) 


In [9]: 


Figure 5-27. The output of training the model 


Now that you finished training, you can look at evaluating your model. To get the 
probability values for each entry in the test set, you have to calculate the free energy for 
each data point (this is a function unique to this version of the RBM). From there, you 
can get the probability of each data point occurring given its free energy. Run the code in 


Figure 5-28. 


192 


1 model.fit(x train, validation_set=x_test) 


Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 
Validation 


Reconstruction error 


0 


cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 
cost 


Training batch losses v.s. iteractions 


at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 
at 


step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 
step 


no I nub whe © 


ui 


10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 


0.9633683 
.88812894 
-8710143 
»§581612 
.8551915 
.84902173 
-85834426 
»65502 
.8582019 
.85367733 
- 8532354 
»8550585 
-85659426 
- 8475488 
6495585 
-8487246 
»-85202855 
-8558134 
~8530945 
-8489576 


ooo oOo oO oa oO & 


ooo poe 8 oo 8 





1000 2000 3000 4000 5000 6000 7000 8000 
Num of training iteractions 


CHAPTER 5 BOLTZMANN MACHINES 


costs = model .qeLrrecinergy (x Lest).resnape (=) 


Score = aucty Test, Costs) 


pranc ("AUC Score: (~:.2¢)".~format (score) ) 





Figure 5-28. Code to get the costs from the test set and get the AUC scores from that 


The output should be something like Figure 5-29. 


In [10]: costs = model.getFreeEnergy(x test) .reshape (-1) 


score = auc(y test, costs) 
print("AUC Score: {:.2%}".format (score) ) 


INFO: tensorflow:Restoring parameters from ./rbm model.ckpt 
AUC Score: 95.84% 


Figure 5-29. The AUC score ended up at 95.84% 


Considering the seemingly simple architecture of the RBM (with how few nodes 
there are in the model compared to neural networks), that’s a pretty good AUC score! 

You can also graph the free energy vs. the probability of each data point to get an 
idea of what the anomalies look like compared to the normal data points. Before you do 
that, let’s check a five-number summary of each data set to get a sense of how they are 
distributed. 


Figure 5-30 shows the code for the five-number summary of the normal data. 


normal = pd. Davakrame(cosisly test—=)]) 


normal.describe() 





Figure 5-30. Code to check the five-number summary of the normal data 


The output should look somewhat like Figure 5-31. 


193 


CHAPTER 5 BOLTZMANN MACHINES 


In [25]: normal = pd.DataFrame(costs[y test==0]) 
normal.describe () 


Out [25]: 


. 


0 

count 84700.000000 
mean 0.760787 
std 87.097176 


min -7.088358 
25% -5.302821 
50% -4.028915 
75% -1.369422 


max 21804.019531 


Figure 5-31. The five-number summary shows that the normal data is right 
skewed, since the values for each quartile are in the negative, while the outlier 
values in the tail bring the mean up into the positives 


Now let’s check the five-number summary of the anomalies (see Figure 5-32). 


anomaly = pd.DataFrame(costs|[y test==1]) 


anomaly.describe () 





Figure 5-32. The code to check the five-number summary for the anomalies 


The output should look somewhat like Figure 5-33. 


In [26]: anomaly = pd.DataFrame(costs[y test==1)) 
anomaly.describe |) 


= om | 


Out(26]: 


0 

count 107.000000 
mean 88.472694 
std 64.513130 
min -6.289360 
25%  36.866241 
50%  98.163078 
75% 128.187202 
max 231.617798 


Figure 5-33. Looking at the data, it seems that all of the anomalies are below 250. 
Knowing this, you can now pick a threshold value so only the relevant data is displayed 
on the graph 


194 


CHAPTER 5 BOLTZMANN MACHINES 


Knowing the general distribution of the data, you can pick a threshold value so that 
only relevant data is shown on the graph. You know the majority of the normal data is 
situated around the value zero, so the outliers are irrelevant to you since they won't show 
up on the graph anyways (a few values for 20,000 won’t show up when compared to tens 
of thousands of values around zero). 

And so let’s choose a cutoff point of 250, since the maximum free energy for an 
anomaly is at around 232. Figure 5-34 shows a graph of the free energy vs. the probabilities 
for the test set. 


plt.title('Free Energy vs Probabilities for Test Set') 
plt.figure(figsize=(15, 10)) 

plt.xlabel ('Free Energy') 

plt.ylabel ('Probabilty') 


plt.hist(costs[(y_ test == 0) & (costs < 250)], bins 
color='green', normed=1.0, label='Normal') 


PlLUshist (Costs (y tes: == 1) & (costs ~ 250), bins 
color='red', normed=1.0, label ='Anomaly') 


plt.legend(loc="upper right") 


plt.show() 





Figure 5-34. Code to plot the free energies associated with x_test and the 
respective probabilities 


Figure 5-35 shows the code. 


pit.title('Pree Energy vs Probabilities for Test Set’) 
plt.figure(figsize=(15, 10)) 

plt.xlabel(*Pree Energy’) 

plt.ylabel (‘Probabilty’) 


plt.hist(costs[(y test == 0) & (costs < 250)), bins = 100, color="green', normed=1.0, label="Normal’) 
plt.hist(costs[(y test =™ 1) & (costs < 250)], bins = 100, color™*red', normed™1.0, label ™*Anomaly') 
pit.legend(loc®"upper right”) 

pit. show () 


Figure 5-35. The code to graph the free energies of the data points and their 
probabilities 


195 


CHAPTER 5 BOLTZMANN MACHINES 


The output graph is shown in Figure 5-36. 


Ba Horrnal 
2 Anomaly 








Probabalty 





0 5p 100 150 200 50 
Free Energy 


Figure 5-36. The graph of the free energies vs. the probability of the normal and 
anomaly data points in the test set with costs less than 500 


The graph automatically graphs the probabilities of the data points based on 
their free energies, but this isn’t exactly made very clear for you to see. The way the 
probabilities are computed correspond with this line of code: 


probs = costs / np.sum(costs) 


This essentially takes the individual free energy and divides it by the total free energy 
associated with the whole set. 

The RBM seems to have learned the distribution well enough that you can see a 
pretty clear separation between the normal values and the anomalies, although there is 
a bit of an overlap. In any case, the RBM performed pretty well on the credit card dataset 
with an AUC of 95.84%. 


196 


CHAPTER 5 BOLTZMANN MACHINES 


Anomaly Detection with the RBM - KDDCUP Data Set 


Remember the KDDCUP data set you looked at in Chapter 2? Let’s try to apply the RBM 
to it as well. The application will be a similar procedure to that in the previous example, 
but instead of dealing with excessively large values in the data set, you will learn how to 
deal with data that is comprised of a hefty number of zero entries. 


Again, you begin by importing all of the necessary modules (see Figure 5-37). 


import pandas as pd 
import tensorflow as tf 
from sklearn-metrics Amport roc auc score as:.auc 


import matplotlib.pyplot as plt 


from sklearn.model selection import train test split 


from sklearn.preprocessing import StandardsScaler 
from boltzmann machines.rbm import * 


from sklearn.preprocessing import LabelEncoder 


smatplotlib inline 





Figure 5-37. Importing the necessary modules 


Next, you need to import your data set. Since you've used it before, you don’t have to 
do df .head() or print out the shape, but it still helps to get a sense of what the data set 
looks like (see Figure 5-38). 


197 


CHAPTER 5 BOLTZMANN MACHINES 


columns = |"Guration", “protocol type", “service”, “Tlag", “sre bytes”, 
"dst bytes", "land", "wrong fragment", "urgent", 


"hou", “num Terbed logins”, “logged 1n", "num compromised”, 
"Toot. shell", “su-ealcemptlca”", “num Pool”, 


"IMM, Fale creations’, “num shells", “num access Tiles”, 
"num outbound cmds", "is host login", 


“Ls Gusest. Jogin", “counu", "Srv count"; "Serror rete”, 
"SEV GELrLror tate", “rerror rate”, “Srv.rerror rate", 


“Same GEV race", “Clit sryv-rate", “ery Oate Host: tate”, 
"OSE NOSt. Count", “GSt lost sry counu”, 


"St _NOst Sane Siy rave’, “dsu host diitl sity fete”, 
"Gs DOSt. Seme @rc Port. rate”, “dst NOSt Srv Gift nost fate”, 


"dst. NOS Se6rror. rate", “Gst NoOst.esrv ‘serror rate”, 
"OSt:. NOSt: rerror rate”, “dst. Most srv frerror rate”, “Label” |] 


df = 
pd.read csv("datasets/kdd cup 1999/kddcup.data/kddcup.data.corrected", 
sep=",", names=columns, index col=None) 


print (df.shape) 


df.head() 





Figure 5-38. Defining the columns and loading the data set 


The output is shown in Figure 5-39. 


In [321]: columns = ["duration", "protocol type", "service", "Elag", "sre_bytes™, "dat bytes", "land", "wrong Fragment”, "urgent", 
"hot", “num failed logins", “logged in", “num compromised", "“roct_ shell", "“su_attempted", “num rect", 

"num File creations", "num shells", “num access Files", “num outbound cmds=", "is host login", 

"is _guest_login", "count", “"srv_count”, "serror_rate", "srv_serror_rate™, “rerror_rate", "srv_rerror_rate”, 

"same sry rate", “diff arv rate", "“arv diff host rate", “dst host count", “dat host arv count", 

“det_host_same_srv_rate™, "dst_host_ diff srv_rate", “dst_host_same_ src_port_rate", “dst_host_srv_diff host_rate”, 

"det host serror rate", "dst host arv_serror_ rate", "dst host rerror rate", "dst host _asrv_rerror_rate", "label”] 


df = pd. read _cav("datasets/kdd cup_1999/kddcup.data/kddcup.data.corrected", sep=",", names=columns, index_col=None) 
print (df. shape) 
gf head (|) 


(4898431, 42) 


duration protocoltype service flag src_bytes dst_bytes land wrong_fragment urgent hot . dst_host_srv_count dst_host_same_srv_rate dst_host_di 


0 0 ep hip = SF 215 45076 0 i] o oO 0 o.0 

1 Oo tep hip 8 SF hz a520 0 0 o oO 1 1.0 

Z 0 tep http SF 236 1228 0 0 o oO z 1.0 

3 i] tep itp = SF Zag 2032 0 0 o 6D 3 1.0 

4 0 tep hitp )=6oSF 239 456 0 0 0.60 a 1.0 
5 Fows = 42 Columns 
< 2 


Figure 5-39. Notice that there are categorical labels to deal with, and that there 
are a huge number of columns per data entry 


198 


CHAPTER 5 BOLTZMANN MACHINES 


As in Chapter 2, you only want to focus on HTTP attacks, so let's filter the data frame 
to only include them (see Figure 5-40). 


at df[df["service"] == "http"] 


ai df.drop("service", axis=1) 


columns.remove ("service") 


print (df.shape) 


df .tail() 





Figure 5-40. Filtering all the entries to include only HTTP attacks and dropping 
the service column from the data frame 


The new output is shown in Figure 5-41. 


In [358]: gf = di[di["service"] == "http"™) 
df= = dft.drop("service", axis=1) 
columns. remove ("service") 


print (df. shape) 
af.teil {) 


(623091, 41) 


duration protocol_type flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins .. dst_host_srv_count dst_host_same_srv_ 


Ad03426 0 kp SF 212 2288 0 H 0 oO Be oa 255 

d20R427 0 kp 86 SF 219 236 0 0 o oO 0 255 

4208428 0 to «6s SF 218 3610 0 0 o oO OP. 255 

408425 0 tp = SF 219 1234 i] 0 0 0 0 255 

4298470 it tp SF 219 1038 i] 0 0 o O 255 

5 ows = 47 columns 

< > 


Figure 5-41. The columns only consist of HTTP attacks. Here you look at the tail 
end of the data frame 


As a reminder, df.tail() performs the same function as df.head() but shows the 
entries from the bottom up as opposed to top down. Also, you can pass a parameter in 
the parenthesis to indicate the number of rows you want to see. 

You don’t want values that are strings in your data, so you have to use the label 


encoder as in Chapter 2 (see Figure 5-42). 


199 


CHAPTER 5 BOLTZMANN MACHINES 


for col ain dfi.columns: 


1f df[col].dtype == "object": 


encoded = LabelEncoder () 
encoded. fit(df[col] ) 


df[col] = encoded.transform(df[col]) 


dL.nead () 





Figure 5-42. Using the label encoder on the categorical values in your data frame 


The new output is shown in Figure 5-43. 


In [359]: for col in df.columns: 
if df[col].dtype == “object”: 
encoded = LabelEncoder() 
encoded. fit (dé[col]) 
df({col] = encoded.transform(df[col)) 


df. head () 


resctn. 
; wi: 


duration protocol_type flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_jogins . dst_host_srv_count dst_host_same_srv_rate d 
0 0 0 9 215 45076 0 0 0 0 0 0 0.0 
1 0 0 q 162 4528 0 0 0 0 0 1 10 
2 0 0 i] 236 1228 0 0 0 0 0 2 10 
3 0 0 9 233 2032 0 0 0 0 0 3 1.0 
4 0 0 9 239 486 0 0 0 0 0 a 1.0 


5 rows x 41 columns 


< > 


Figure 5-43. The output showing the new data frame with the categorical values 
converted to integer label equivalents 


In this data set, the normal data entries comprise an overwhelmingly large 
proportion of the data entries, pretty much drowning out the anomalous data. Not only 
that, but you don’t want to pass in all of the data values into the RBM, so you will create 
a new data frame that contains a portion of normal data entries and all of the anomalous 


data entries. Run the code in Figure 5-44. 


200 


CHAPTER 5 BOLTZMANN MACHINES 


anomalies = df[df["label"] != 4] 


normal = dftldt["“label"] == 4] 


for f in range(0, 10): 


normal = normal.iloc[np.random. permutation (len(normal) ) ] 


novelties = pd.concat([normal[:50000], anomalies] ) 


novelties.shape 





Figure 5-44, Code to define an anomaly data set and a normal data set. Then, the 
normal data set is shuffled to ensure random selections, and a new data set named 
novelties is formed 


As in Chapter 2, the normal labels are encoded as 4 so you can use them as the basis 
to separate the normal entries from the anomalies. 

Since the data set is so large, the entries are shuffled randomly ten times before a sample of 
50,000 is selected from them. This is to ensure a random selection of values from the entire 
data set instead of having the entries just in the top 50,000. The output is shown in Figure 5-45. 


In [360]: anomalies = df[df["label"] != 4] 
normal = df[df["label"] == 4] 


for £ in range(0, 10): 
normal = normal.iloc[np.random.permutation(len(normal) ) ] 


novelties = pd.concat([normal[:50000], anomalies)) 
novelties.shape 


* 


ut [360]: (54045, 41) 


Figure 5-45. The output of the code in Figure 5-44 


One thing about the KDDCUP data set is that there are a massive amount of entries 
with data values as either miniscule values or as 0. You've dealt with massive values 
with the credit card data set, and you know that those values can throw off the training 
process entirely. Likewise, massive amounts of zero values or really tiny data values can 


also hamper the training process. 


201 


CHAPTER 5 BOLTZMANN MACHINES 


Since novelties.head() only displays some of the columns, you'll have to use 
something else to check every column, so look at the code in Figure 5-46. 


Sy 


with pd.option context ('display.max rows', 
'display.max columns', 


Al): 


print (novelties) 





Figure 5-46. Code to print all the columns and five rows in the data frame 


The parameters are self-explanatory. In the example, all 41 columns are displayed for 
the first 5 rows (Figure 5-47 and Figure 5-48). 


In [367 with pd.option_context ("display.max rows’, 3, ‘display.max_columns’, 41): 
print (novelties) 
duration protecol_type flag srec_bytes dst_bytes land \ 
1040102 0 0 9 198 27266 0 
793833 0 0 9 227 345 0 
4764841 0 0 9 54540 8314 0 
4764842 0 0 9 54540 8314 0 
wrong fragment urgent hot num failed logins logged in \ 
7193833 0 0 0 0 1 
476484 0 0 2 0 1 
4764842 0 2 0 1 
num compromised root_shell su_attempted num root \ 
1040102 0 0 0 0 
793833 0 0 0 0 
4764841 1 0 0 0 
4764842 1 0 0 0 
num file creations num_shells num access files num outbound cmds 
1040102 0 0 0 0 
793833 0 0 0 0 
4764841 0 0 0 0 
4764842 0 0 0 0 
is host login is_quest_login count srv_count serror_rate 
1040102 0 0 3 3 0.0 
793833 0 0 12 24 0.0 
4764841 0 3 3 0.0 
4764842 0 0 3 3 0.0 
Srv_serror rate rerror_ rate srv_rerror_rate same srv_rate 
1040102 0.0 0.0 0.0 1.0 
793833 0.0 0.0 0.0 1.0 
4764841 0.0 0.0 0.0 1.0 
4764842 0.0 0.0 0.0 1.0 
Figure 5-47. The output from the code in Figure 5-46. Notice the massive amount 


of zero values in the columns of the data entries 


202 


CHAPTER 5 BOLTZMANN MACHINES 


diff srv_rate srv diff host rate dst host count \ 


1040102 0.0 0.60 84 
793833 0.0 0.12 253 
4764641 0.0 0.00 99 
4764642 0.0 0.00 100 


dst host srv_count dst host same srv_rate dst host diff srv_rate \ 


1040102 be be 1.0 0.0 
793833 2Ja 1.0 0.0 
4764841 99 1.0 0.0 
4764642 100 1.0 0.0 


dst host same src port rate dst host srv diff host rate \ 


1040102 0.01 0.03 
793833 0.00 0.00 
4764841 0.01 0.00 
4764842 0.01 0.00 


dst host serror_ rate dst_ host _srv_serror_rate dst host rerror_rate \ 


1040102 0.01 0.00 0.01 
793833 0.00 0.00 0.00 
4764641 0.01 0.01 0.01 
4764842 0.01 0.01 0.01 


dst host srv_rerror rate label 


1040102 0.01 4 
793833 0.00 4 
4764841 0.01 0 
4764842 0.01 0 


(54045 rows x 41 columns] 


Figure 5-48. The rest of the output continued from Figure 5-46. There are still 
many zero values or really small values in each entry 


While the large amount of zero-value entries might not have affected the isolation 
forest, they will certainly mess with the training process of the RBM, leading to terrible 
AUC scores. Therefore, standardizing all of the values will help the RBM during the 
training process and help it attain proper AUC scores. 

You don’t want to standardize the data values for the columns protocol_type, flag, or 
label, so exclude them specifically (see Figure 5-49). 


203 


CHAPTER 5 BOLTZMANN MACHINES 


for c in columns: 


ZE(C 2= “protocol type” and. ¢ t= “flag” and: ¢ t= “lebel”)< 


novelties[c] = 
PtLandardocaler().fit- Lransiorm(nevellies(C)«values reshape (1, 


ty) 


novelties.head() 





Figure 5-49. Standardizing every value except for the columns the label encoder 
transformed 


The output showing the standardized data is shown in Figure 5-50, Figure 5-51, and 
Figure 5-52. 


In [346]: for c in columns: 
if(c != "protocol _type” and c != "flag" and c != "label”): 


novelties[c] = StandardScaler().fit_transform(novelties([c].values.reshape(-1, 1)) 


novelties.head() 


Figure 5-50. The code in a Jupyter cell 


“a duration protocol_type flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins .. dst_host_srv_count dst_host_sas 
197882 -0.007301 0 9 -0.199908 -0.179237 0.0 0.0 0.0 -0.209332 oo”. 0.329176 | 
369876 -0.007301 0 9 -0.205336 -0.184283 0.0 0.0 0.0 -0.209332 0.0 0.329176 
336092 -0.007301 0 9 -0205336 -0.146532 0.0 0.0 0.0 0.209332 00. 0.329176 
4789776 -0.007301 0 9 -0205336 -0.159467 0.0 0.0 0.0 -0.209332 0.0 0.329176 
758885 -0.007301 0 9 -0.204213 -0.179466 0.0 0.0 0.0 -0.209332 0.0 . 0.329176 
5 rows « 41 columns 
< > 


Figure 5-51. The first part of the output showing that most of the values have been 
transformed 


mut [346] = 
= dst_host_diff_smv_rate dst_host_same_src_port_rate dst_host_sm_diff_host_rate dst_host_serror_rate dst_host_srv_serrorrate dst_host_remorrate dst_host_: 
0.124983 0.986373 0.003223 0.181091 -0.179287 0.322855 
“0.124983 -0.397602 0.589299 -0.181091 -0.179287 0.322855 
“0.124983 “0.391602 “0.391791 0.181094 0.179287 0.322855 
“0.124983 0.336463 0.200731 ~0.181097 -0.17 9287 “0.286223 
“0.124983 -0.115907 0.194284 -0.181091 “0.179287 0.322855 
¢ > 


Figure 5-52. The same output but scrolled right to show that more of the values 
have been transformed 


204 


CHAPTER 5 BOLTZMANN MACHINES 


As you can see, most of the zero value entries have been standardized in accordance 
with all of the values in their respective columns. The few nonzero entries in these 
columns will help the scaler to standardize the rest of the values in that column. 

Just as you want to avoid massive values in the training set, you also seek to avoid 
large amounts of zero value entries in the data. In both such cases, the calculations 
for the gradient will be thrown off, resulting in cases such as the “exploding gradient” 
(gradients so big that the model can never converge on the local minimum) or the 
“vanishing gradient” (gradients so small that they are practically nonexistent, and 
the model never converges on the local minimum). An abundance of values that are 
too large or too small can negatively affect the training process, so it’s a good idea to 
preprocess the data set before training the model on it. 


Now you can move on to defining your training and testing sets (see Figure 5-53). 


novelties.iloc[:43000, 1:-2].values 


novelties.iloc[:43000, -1].values 


novelties.iloc[43000:, 1:-2].values 


novelties.iloc[43000:,-1].values 


print ("Shapes: \nx train:ss\ny train:os\n" < (x Crain.Shape, 
y train.shape) ) 


Ppranc("x té6stt2s\ny Testi<s\n" & (* Test. shape, y test.shape) ) 


y CeEst 





Figure 5-53. Defining the training and testing sets and printing out the shapes of each 


The corresponding output is shown in Figure 5-54. 


205 


CHAPTER 5 BOLTZMANN MACHINES 


novelties.iloc[:43000, 1:-2].values 
novelties.iloc[:43000, -1].values 


In [14]: x train 


y_train 


x test = novelties.iloc[43000:, 1:-2]).values 
y_ test = novelties.iloc[43000:,-1).values 


print ("Shapes:\nx train:%s\ny train:%s\n" % (x_train.shape, y train.shape)) 
print ("x_test:%s\ny test:%s\n" % (x_test.shape, y test.shape) ) 


y_test 
Shapes: 


x train: (43000, 36) 
y_train: (43000, ) 


x test: (11045, 36) 
y_test: (11045, ) 


Out[i4): array([4, 4, 4, ..., 0, 0, 0), dtype=inté4) 
Figure 5-54. The output shapes and some entries of y_test are displayed 


The 43,000 entries indicate a roughly 80-20 split between the training and testing 
data sets. 

Again, you drop the last column, since this is unsupervised training (although it is 
true that both the anomalies and the normal entries are labeled, the model only sees 
unlabeled data during the training and prediction processes). 

With your data sets created, you can define and train the model (see Figure 5-55, 
Figure 5-56, and Figure 5-57). 


model = KRBM(x train.shape[l], 20, visible unltt type="Ggauss", 
main dir='./', model name='rbm model2.ckpt', 


gibbs sampling steps=4, learning rate=0.001, 
momentum =— 0.95, batch saze=512, num: epechs=20, verbose=1) 





Figure 5-55. Initializing the model 


The code to train the model is shown in Figure 5-56. 


model.fit(x train, validation set=x test) 


Figure 5-56. Training the model on x_train, using x_test as validation data 


206 


CHAPTER 5 BOLTZMANN MACHINES 


The output you should see is shown in Figure 5-57. 


In [17]: i model.fit(x_train, validation_set=x_test) 


Validation cost at step 0: 1.4761298 
Validation cost at step 1: 1.4645535 
Validation cost at step 2: 1.4219579 
Validation cost at step 3: 1.4172356 
Validation cost at step 4: 1.42167 
Validation cost at step 5: 1.41365312 
Validation cost at step 6: 1.418542 
Validation cost at step 7: 1.3989593 
Validation cost at step 8: 1.4185325 
Validation cost at step 9: 1.4090425 
Validation cost at step 10: 1.4065987 
Validation cost at step 11: 1.4020221 
Validation cost at step 12: 1.4002018 
Validation cost at step 13: 1.4049628 
Validation cost at step 14: 1.4142944 
Validation cost at step 15: 1.4096367 
Validation cost at step 16: 1.3955325 
Validation cost at step 17: 1.4014637 
Validation cost at step 18: 1.3970937 
Validation cost at step 19: 1.3985484 


Training batch losses v.s. iteractions 


Reconstruction error 





0 20 500 50 1000 150 i500 1730 
Num of training iteractions 


Figure 5-57. The training output by the model for the code in Figure 5-56 


Since the labels aren’t binary, you want to redefine them as either normal, 0, or 
anomalous, 1. Run the code in Figure 5-58. 


207 


CHAPTER 5 BOLTZMANN MACHINES 


for f an tange(0, lenty test) ): 


if y test[f] == 4: 
y tester)! = 0 
else: 


y test] 





Figure 5-58. Code to change all labels that are 4 to 0, representing normal entries, 
and all labels that aren’t 4 to 1, representing anomalies 


The output you should see is shown in Figure 5-59. 


In [353]: for f in range(0, len(y test)): 
if y test(f) = 4: 
y_test(f) = 0 
else: 
y test(f) = 1 
y_ test 
Out [353]: array(({0, 0, 0, ..., 1, 1, 1), dtype=inté4) 


Figure 5-59. The labels should now be transformed. Some of the entries in y_ test 
are shown to make sure they were transformed correctly 


Now that your labels have been corrected, you can get the free energy and find the 
AUC score (see Figure 5-60). 


Gcsts = MocelvwgelPrcemnergy (xX test) reshape (—1) 


SCOTS = aucty test, Costs) 


print ("AUC Score: {:.2%5}".format (score) ) 





Figure 5-60. Code to get the free energy for each model in x_test and then to find 
the AUC score based on that 


208 


CHAPTER 5 BOLTZMANN MACHINES 


The output you should see is shown in Figure 5-61. 


In [19]: costs = model.getFreeEnergy(x test) .reshape(-1) 
score = auc(y test, costs) 
print ("AUC Score: {:.2%}".format (score) ) 


INFO: tensorflow:Restoring parameters from ./rbm modelz.ckpt 
AUC Score: 99.46% 


Figure 5-61. The generated AUC score 


That’s an even better AUC score than for the credit card data set! Let’s take a look 
at what happens when you plot the free energy vs. the probability. As with the previous 
example, let’s take a look at the five-number summary for the normal data to see how the 
distribution looks (Figure 5-62 and Figure 5-63). 


normal data = pd.DataFrame(costs[y test == 0]) 


normal data.describe() 





Figure 5-62. Code to check the five-number summary of the normal data 


The output should look somewhat like Figure 5-63. 


In [22]: normal data = pd.DataFrame(costs[y test == 0]) 
normal data.describe () 


Out [22]: 


0 

count 7000.000000 
mean -43.244312 
std 22784908 
min -46.898159 
25% ji -46.566977 
50% -46.379679 
75%  -45.858756 
max 1145.513062 


Figure 5-63. It seems that the graph is skewed right, and that all of the values are 
under 1150 


209 


CHAPTER 5 BOLTZMANN MACHINES 


Now let’s look at the five-number summary to see what the general distribution of 


the anomalous data looks like (see Figure 5-64 and Figure 5-65). 


anomalies = pd.Datarrame(costs|ly test == 1]) 


anomalies.describe() 





Figure 5-64. Code to check the five-number summary of the anomalous data 


The output should look somewhat like Figure 5-65. 


In [21]: anomalies = pd.DataFrame(costs[y test == 1])) 
anomalies.describe () 


0 

count 4045.000000 
mean 44 125891 
std 100.816040 
min -34.099133 
25%  j-11.051010 
50% -4.358/704 
TS% 89. 738434 
max 14705213851 


Figure 5-65. Based on the maximum value, you don’t need to filter out any values 
for cost, except for what is an anomaly and what is a normal point 


Now you can graph the free energy vs. the probabilities for each value in the test set 


separated by their label. Run the code in Figure 5-66. 


210 


CHAPTER 5 BOLTZMANN MACHINES 


plt.title('Free Energy vs Probabilities for Test Set') 


plt.figure (figsize=(15,10) ) 
plt.xlabel('Free Energy') 
plt.ylabel ('Probabilty') 


PlLshisl(coscely test == 0), bins 100, color='green', 
normed=1.0, label='Normal') 


ple hist icosre ly test == lj, Dinas 100, color='red', normed=1.0, 
label ='Anomaly') 


plt.legend(loc="upper right") 


pLe.show () 





Figure 5-66. Code to plot the free energy vs. the probability for each entry in the 
test set. All of the anomalies have free energies under 1500, so you can filter out all 
values for cost under 1500 to make the graph easier to visualize 


The output should look somewhat like Figure 5-67. 


0.07 
0.06 


0.05 


Probabilty 
a 
S 
‘fim. 


0.03 7 


O02 4 





0.00 


i] 200 ah) bie) Bo) 1000 1200 1400 
Free Energy 


Figure 5-67. There seems to be a defined separation between the anomalies and 
the normal data points. The anomalies in general seem to have a much higher free 
energy cost and a lower-than-usual probability of occurring 


211 


CHAPTER 5 BOLTZMANN MACHINES 


Once again, the RBM has learned the distribution well enough that there’s a clear 


and defined separation between the anomalies and the normal data entries. 


Summary 


In this chapter, we discussed restricted Boltzmann machines and how they can be used 
for anomaly detection. We also explored the application of the RBM to two data sets that 
represented two cases where standardization of the data is necessary for proper training. 
You now know more about what an RBM is, how it works, and how to apply it to different 
data sets. 

In the next chapter, we will take a look at anomaly detection using recurrent neural 
networks. 


212 


CHAPTER 6 


Long Short-Term 
Memory Models 





In this chapter, you will learn about recurrent neural networks and long short-term 
memory models. You will also learn how LSTMs work and how they can be used to 
detect anomalies and how you can implement anomaly detection using LSTM. You 
will work through several datasets depicting time series of different types of data such 
as CPU utilization, taxi demand, etc. to illustrate how to detect anomalies. This chapter 
introduces you to many concepts using LSTM so as to enable you to explore further 
using the Jupyter notebooks provided as part of the book material. 

In a nutshell, the following topics will be covered throughout this chapter: 


e Sequences and time series analysis 
e What is aRNN? 
e Whatis an LSTM? 


e LSTM applications 


Sequences and Time Series Analysis 


Atime series is aseries of data points indexed in time order. Most commonly, a 
time series is a Sequence taken at successive equally spaced points in time. Thus, it is a 
sequence of discrete-time data. Examples of time series are ECG data, weather sensors, 


and stock prices. 


213 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_6 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-1 shows some examples of time series. 


Time Series of Value by Date Time 





jul Aug Sep Oct Nov Dec 


Date Time 


Figure 6-1. A time series 


Figure 6-2 shows the monthly values of AMO index for last 150 years. 


Monthly Values for the AMO Index, 1856 -2013 


0.6 
0.4 


0.2 


AMO Departure 
oO 





186 1880 1900 1920 1940 1960 1980 2000 


Year 


Figure 6-2. Monthly values of the AMO index 


214 


CHAPTER6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-3 shows a chart of the BP stock price for a 20-year time period. 


BP Stock Price on the NYSE 1979 - 1999 


120 } 


~| it | 
{J 


| ‘ net Adusted Closing Price 
ey \ Na! 
af 4 Ba Wa 
ay 


Price Per Share (USD) 


} A Yd " 
wa ye 
nf 


& 


se > wh < > 
mS > & e r 


or 2 a = ‘ 
<> < ~—\* ~\* ~\* x“ 


Figure 6-3. BP stock price 


Time series analysis refers to the analysis of change in trends of data over a period of 
time. Time series analysis comprises methods for analyzing time series data in order 
to extract meaningful statistics and other characteristics of the data and has a variety of 
applications. One such application is the prediction of the future value of an item based 
on its past values. Future stock price prediction is probably the best example of such 
an application. Another very important use case is the ability to detect anomalies. By 
analyzing and learning the time series in terms of being able to understand the trends 
and changes seen from historical data, we can detect abnormal or anomalous data 


points in the time series. 


Figure 6-4 is a time series with anomalies. It shows the normal data in green and 
possible anomalies in red. 


215 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Time Series of Value by Date Time 


Observation 





2013-07 2013-08 2013-09 2013-10 M13-11 213-12 2014-01 2014-07 2014-03 2014-04 2014-05 2014-06 
Date Time 


Figure 6-4. Time series with anomalies 


What Is a RNN? 


You have seen several types of neural networks throughout the book so you know that 


the high-level representation of neural networks looks like Figure 6-5. 


pean, —_—ie 


Figure 6-5. A high-level representation of neural networks 





Clearly, the neural network processes input and produces output, and this works on 
many types of input data with varying features. However, a critical piece to notice is that 
this neural network has no notion of the time of the occurrence of the event (input), only 
that input has come in. 

So what happens with events (input) that come in as a stream over long periods of 
time? How can the neural network shown above handle trending in events, seasonality 
in events, etc.? How can it learn from the past and apply it to the present and future? 

Recurrent neural networks try to address this by incrementally building neural 
networks, taking in signals from a previous timestamp into the current network. 

Figure 6-6 shows a RNN. 


216 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Input T1 Recurrent Neural Network 


Input T2 Recurrent Neural Network 





~ CO 


Input Tn Recurrent Neural Network 





Figure 6-6. A recurrent neural network 


You can see that RNN is a neural network with multiple layers or steps or stages. 
Each stage represents a time T; the RNN at T+1 will consider the RNN at time T as one 
of the signals. Each stage passes its output to the next stage. The hidden state, which is 
passed from one stage to next, is the key for the RNN to work so well and this hidden 
state is analogous to some sort of memory retention. A RNN layer (or stage) acts as an 
encoder as it processes the input sequence and returns its own internal state. This state 
serves as the input of the decoder in the next stage, which is trained to predict the next 
point of the target sequence, given previous points of the target sequence. Specifically, 
it is trained to turn the target sequences into the same sequences but offset by one 
timestep in the future. 

Backpropagation is used when training a RNN as in other neural networks, but 
in RNNs there is also a time dimension. In backpropagation, we take the derivative 
(gradient) of the loss with respect to each of the parameters. Using this information 
(loss), we can then shift the parameters in the opposite direction with a goal to minimize 
the loss. We have a loss at each timestep since we are moving through time and we 
can sum the losses across time to get the loss at each timestep. This is the same as 
summation of gradients across time. 

The problem with the above recurrent neural networks, constructed from regular 
neural network nodes, is that as we try to model dependencies between sequence values 
that are separated by a significant number of other values, the gradients of timestep 


217 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


T depends on gradients at T-1, gradients at T-2, and so on. This leads to the earliest 
eradient’s contribution getting smaller and smaller as we move along the timesteps 
where the chain of gradients gets longer and longer. This is what is known as the 
vanishing gradient problem. This means the gradients of those earlier layers will become 
smaller and smaller and therefore the network won’t learn long-term dependencies. 
RNN becomes biased as a result, only dealing with short-term data points. 

LSTM networks are a way of solving this problem with RNNs. 


What Is an LSTM? 


A LSTM network is a kind of recurrent neural network. As seen above, a recurrent 
neural network is a neural network that attempts to model time or sequence dependent 
behavior, such as language, stock prices, weather sensors, and so on. This is performed 
by feeding back the output of a neural network layer at time T to the input of the 

same network layer at time T+ J. LSTM builds on top of the RNN, adding a memory 
component meant to help propagate the information learned at a time T to the future 
T+1, T+2, and so on. The main idea is that LSTM can forget irrelevant parts of previous 
state while selectively updating state and then outputting certain parts of the state that 
are relevant to the future. 

How does this solve the vanishing gradient problem in RNNs? Well, now we are 
throwing some state, updating some state, and propagating forward some part of the 
state so we no longer have a long chain of backpropagation seen in RNNs. Thus, LSTMs 
are much more efficient than typical RNN. 


218 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-7 is a RNN with tanh activation. 


Recurrent Neural Network 


Input from stage T 


Hidden state from 
stage | 


Hidden state from 
stage T-1 





Figure 6-7. A RNN with tanh activation 
The tanh function is called an activation function. There are several types of 


activation functions that help in applying non-linear transformations on the inputs at 


every node in the neural network. Figure 6-8 shows common activation functions. 


Common Activation Functions 


Sigmoid TanH ReLU 
1.2 15 ‘s 
' 42) = ——~ <a | tanh(z) = ———_ - =e. fz)={° for x <0 
08 i+e* mS tee ihe aoe 
; 
f 








Figure 6-8. Common activation functions 
The key idea behind activation functions is to add non-linearity to the data to align 


better with real-world problems and real-world data. In Figure 6-9, the top graph shows 
linearity and the bottom graph shows nonlinearity. 


219 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 





Linearity | 





Figure 6-9. Linear and nonlinear data plots 


Clearly, there is no linear equation to handle the nonlinearity so we need an 
activation function to deal with this property. The different activation functions are listed 
at https://keras.io/activations/. 

In time series data, the data is spread over a period of time, not some instantaneous 
set such as seen in Chapter 4 autoencoders, for example. So not only it is important to look 
at the instantaneous data at some time T, it is also important for older historical data to 
the left of this point to be propagated through the steps in time. Since we need the signals 
from historical data points to survive for a long period of time, we need an activation 
function that can sustain information for a longer range before going to zero. tanh is the 


ideal activation function for the purpose and is graphed as shown in Figure 6-10. 


tanh(a) 





Figure 6-10. tanh activation 


220 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


We also need sigmoid (another activation function) as a way to either remember or 
forget the information. A sigmoid activation function is shown in Figure 6-11. 


1 


0.5 


—6 —4 —2 0 2 ‘4 6 


Figure 6-11. A sigmoid activation function 


Now, conventional RNNs have a tendency to remember everything including 
unnecessary inputs which results in an inability to learn from long sequences. By 
contrast, LSTMs selectively remember important inputs and this allows them to handle 
both short-term and long-term dependencies. 

So how does LSTM do this? It does this by releasing information between the hidden 
state and the cell state using three important gates: the forget gate, the input gate, and 
the output gate. A common LSTM unit is composed of a cell, an input gate, an output 
gate, anda forget gate. The cell remembers values over arbitrary time intervals and 
the three gates regulate the flow of information into and out of the cell. 

A more detailed LSTM architecture is shown in Figure 6-12. There are a couple of key 
functions used, the tanh and the sigmoid, which are activation functions. F, is the forget 
gate, I, is the input gate, and O, is the output gate. 


221 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 





LSTM Unit 


Figure 6-12. A detailed LSTM network 
Source: commons. wikimedia.org 


A forget gate is the first part of the LSTM stage and pretty much decides how much 
information from a prior stage should be remembered or forgotten. This is accomplished 
by passing the previous hidden state hT-1 and current input xT through a sigmoid 
function. 

The input gate helps decide how much information to pass to current stage by using 
the sigmoid function and also a tanh function. 

The output gate controls how much information will be retained by the hidden state 
of this stage and passed onto the next stage. Again, the current state passes through the 
tanh function. 

Just for information, the compact forms of the equations for the forward pass of an 
LSTM unit with a forget gate are (source : Wikipedia) 

f,=°o, (W,x, +U ,h,_, + b,) 

i, =o, (W,x,+U,h,, +b, ) 
0,=0,(W,x,+U,h,, +8, ) 

c,= f,°¢,_,+i,00,(W,x,+U.h,, +b, ) 


h, — 0, 00;,(c, ) 


where the initial values are cy = 0 and hy = 0, and the operator ° denotes the element-wise 
product. The subscript indexes the time step. 


222 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Variables 


xt € Rd {\displaystyle x_{t}\in \mathbb {R} “{d}! x, € IR*: Input vector 
to the LSTM unit 

ft € Rh {\displaystyle f_{t}\in \mathbb {R} “{h}} f, € IR”: Forget gate’s 
activation vector 

it € Rh {\displaystyle i_{t}\in \mathbb {R} “{h}} i, € IR”: Input/update 
gate’s activation vector 

o t€ Rh {\displaystyle o_{t}\in \mathbb {R} “{h}} 0, € R’: Output 
gate’s activation vector 

ht€ Rh {\displaystyle h_{t}\in \mathbb {R} “{h}} h, € R’: Hidden 
state vector, also known as the output vector of the LSTM unit 

ct € Rh {\displaystyle c_{t}\in \mathbb {R} “{h}} c, € R”: Cell state 
vector 

WeERh~-x d {\displaystyle W\in \mathbb {R} “{h\times d}! We R"™4, 
Ue R"*" and b € R"U € Rh xh {\displaystyle U\in \mathbb {R} 
Ath\times h}} b € Rh {\displaystyle b\in \mathbb {R} “{h}} : Weight 
matrices and bias vector parameters, which need to be learned 


during training 


The superscripts refer to the number of input features and number of hidden units, 


respectively. 


oO, : sigmoid function 
o.,: hyperbolic tangent function 
o,, : hyperbolic tangentfunction 


LSTM for Anomaly Detection 


In this section, you will look at LSTM implementations for some use cases using time 


series data as examples. You have few different time series datasets to use to try to detect 


anomalies using LSTM. All of them have a timestamp and a value that can easily be 
plotted in Python. 


223 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-13 shows the basic code to import all necessary packages. Also note the 


versions of the various necessary packages. 


import keras 

from keras import optimizers 

from keras import losses 

from keras.models import Sequential, Model 

from keras. layers import Dense, Input, Dropout, Embedding, LSTM 
from keras.optimizers import RMSprop, Adam, Nadam 

from keras preprocessing import sequence 

from keras.callbacks import Tensorfoard 


import sklearn 

from sklearn.preprocessing import Standardscaler 

from sklearn.model_selection import train_test_split 

from sklearn.metrics import confusion_matrix, roc_auc_score 
from sklearn.preprocessing import MinMaxScaler 


import seaborn as sns 
import pandas as pd 
import numpy as np 
import matplotlib 


import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 
amatplotlib inline 


import tensorflow 
import sys 
print("Python: ", sys.version) 


print("pandas: ", pd._ version_) 
print("numpy: “, np.__version__) 
print("seaborn: ", sns.__version_) 
print("matplotlib: ", matplotlib._version_) 
print("sklearn: ", sklearn._version_) 
print("Keras: ", keras.__version_) 
print("Tensorflow: ", tensorflow.__version_) 


Using TensorFlow backend. 


Python: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMDéE4)] 
pandas: 8.24.2 

numpy: 1.16.3 

seaborn: 6.9.8 

matplotlib: 3. 
Sklearn: #@8.20. 
Keras: 2.2.4 
Tensorflow: 1.13.1 


@.3 
3 


Figure 6-13. Code to import packages 


Figure 6-14 shows the code to visualize the results via a chart for the anomalies and a 


chart for the errors (the difference between predicted and truth) while training. 


224 


CHAPTER6 LONG SHORT-TERM MEMORY MODELS 


Class Visualization: 
labels = ["“Normal", “Anomaly"™] 


def draw_anomaly(self, y, error, threshold): 
groupsDF = pd.DataFrame({'error’: error, 
‘true’: y}).groupby( ‘true’ ) 


figure, axes = plt.subplots(figsize=(12, 8)) 


for name, group in groupsDF: 
axes.plot(group.index, group.error, marker='x' if name == 1 else ‘o', linestyle="', 
color='r" if name == 1 else ‘'g’, label="Anomaly” if name == 1 else “Normal”) 


axes.hlines(threshold, axes.get_xlim()[@], axes.get_xlim()[1], colors="b", zorder=10@, label=" 
axes. legend() 


plt.title("Anomalies”) 
plt.ylabel("Error”) 
plt.xlabel("Data”) 
plt.show() 


def draw_error(self, error, threshold): 
plt.figure(figsize=(190, 8)) 
plt.plot(error, marker='0', ms=3.5, linestyle="', 
label="Point' ) 


plt.hlines(threshold, xmin=@, xmax=len(error)-1, colors="r", zorder=100, label='Threshold’) 
plt.legend() 

plt.title("Reconstruction error”) 

plt.ylabel("Error™) 

plt.xlabel("“Data”) 

plt.show() 


Figure 6-14. Code to visualize errors and anomalies 


You will use different examples of time series data to detect whether a point is 
normal/expected or abnormal/anomaly. Figure 6-15 shows the data being loaded into a 
Pandas dataframe. It shows a list of paths to datasets. 


dataFilePaths = [‘data/art_daily_no_noise.csv', 
‘data/art_daily_nojump.csv', 
‘data/art_daily_jumpsdown.csv', 
‘data/art_daily perfect_square wave.csv', 
‘data/art_increase spike density.csv', 
‘data/art_load_ balancer_spikes.csv', 
‘data/ambient_temperature_system_failure.csv’, 
‘data/nyc_taxi.csv', 
‘data/ec2_cpu_utilization.csv', 
‘data/rds_cpu_utilization.csv' ] 


Figure 6-15. A list of paths to datasets 


2290 


CHAPTER 6 


LONG SHORT-TERM MEMORY MODELS 


You will work with one of the datasets in more detail now. The dataset is nyc_taxi, 


which basically consists of timestamps and demand for taxis. This dataset shows the 
NYC taxi demand from 2014-07-01 to 2015-01-31 with an observation every half hour. 
There are few detectable anomalies in this dataset: Thanksgiving, Christmas, New Year’s 


Day, a snow storm, etc. 


Figure 6-16 shows the code to select the dataset. 


i=7 


tensorlog = tensorlogs[i] 
dataFilePath = dataFilePaths[i] 
print("tensorlog: ", tensorlog) 


print("“dataFilePath: 


tensorlog: 
dataFilePath: 


nyc_taxi 
data/nyc_taxi.csv 


", dataFilePath) 


Figure 6-16. Code to select the dataset 


You can load the data form the dataFilePath as a csv file using Pandas. Figure 6-17 


shows the code to read the csv datafile into Pandas. 


df = pd.read_csv(filepath_or_buffer=dataFilePath, header=0, sep=",") 
print('Shape:' , df.shape[@)) 

print( ‘Head: ") 
print(df.head(5)) 


Shape: 16320 
Head: 


5) 
1 
Z 
3 
4 


timestamp 
2014-87-01 68:00:00 
2014-87-81 68:30:00 
2014-87-01 61:00:00 
2014-87-61 61:30:96 
2014-67-01 62:00:00 


value 
16844 
6127 
62108 
4656 
38208 


Figure 6-17. Code to read a csv datafile into Pandas 


Figure 6-18 shows the plotting of the time series showing the months on the 


x-axis and the value on the y-axis. It also shows the code to generate a graph showing 


the time series. 


226 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


df['Datetime'] = pd.to_datetime(df[ *timestamp']) 
print(df.head(3)) 

df.shape 

df.plot(x="Datetime’, y="value’, figsize=(12,6)) 
plt.xlabel( ‘Date time’) 

plt.ylabel( ‘Value’ ) 

plt.title('Time Series of value by date time’) 


timestamp value Datetime 
@ 2014-07-01 00:00:00 10844 2014-07-01 90:00:00 
1 2014-07-01 00:30:00 8127 2014-07-01 90:30:00 
2 2014-07-01 01:00:00 6210 2014-07-01 01:00:00 


Text(@.5, 1.0, ‘Time Series of value by date time’) 


Time Series of Value by Date Time 





jan 
2015 


Date Time 


Figure 6-18. Plotting the time series 


Let’s understand the data more. You can run the describe() command to look at the 
value column. Figure 6-19 shows the code to describe the value column. 


df.value.describe() 


count 10320 .8@0000 
mean 15137.569380 


std 6939.495808 
min 8.800000 
25% 10262 .8000020 
50% 16778 .8Q0000 
75% 19838.750000 
max 39197 .2e0000 


Name: value, dtype: float64 


Figure 6-19. Describing the value column 


220 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


You can also plot the data using seaborn kde plot, as shown in Figure 6-20. 


fig, (axl) = plt.subplots(ncols=1, figsize=(3, 5)) 
axl.set_title( "Before Scaling’) 
sns.kdeplot(df[‘value"], ax=ax1) 
<matplotlib.axes. subplots.AxesSubplot at @x2ba2b@b3bas> 
Before Scaling 
0.00008 
0.00007 
0.00006 
0.00005 
0.00004 
0.00003 
0.00002 


0.00001 








0.00000 


o 10000 20000 30000 40000 


Figure 6-20. Using kde to plot the value column 


The data points have a minimum of 8 and maximum of 39197, which is a wide range. 
You can use scaling to normalize the data. 

The formula for scaling is (x-Min) / (Max-Min). Figure 6-21 shows the code to scale 
the data. 


from sklearn.preprocessing import MinMaxScaler 

scaler = MinMaxScaler(feature_range = (@, 1)) 

df['scaled_value']) = pd.DataFrame(scaler.fit_transform(pd.DataFrame(df[ ‘value’ ))),columns=[ 'value‘]) 
print('Shape:" , df.shape[@]) 

df .head(5) 


Shape: 16320 


timestamp value Datetime scaled_value 
0 2014-07-01 00:00:00 10844 2014-07-01 00:00:00 0.276506 
1 2014-07-01 00:30:00 8127 2014-07-01 00:30:00 O.207175 


2 2014-07-0101:00:00 6210 2014-07-01 01:00:00 0.158259 
3 2014-07-01 01:30:00 4656 2014-07-01 01:30:00 0.118605 
4 2014-07-01 02:00:00 2820 2014-07-01 02:00:00 0.007272 


Figure 6-21. Code to scale the data 
228 


CHAPTER 6 


LONG SHORT-TERM MEMORY MODELS 


Now that you scaled the data, you can plot the data again. You can plot the data using 


seaborn kde plot, as shown in Figure 6-22. 


fig, (axl) = plt.subplots(ncols=1, figsize=(3, 5)) 


axl.set_title('After Scaling’) 
sns.kdeplot(df[‘scaled_value’'], ax=ax1) 


<matplotlib.axes. subplots.AxesSubplot at @x2ba2c@a755@> 





Figure 6-22. Using kde to plot the scaled_value column 


You can take a look at the dataframe now that you have scaled the value column. 
Figure 6-23 shows the dataframe showing the timestamp and value as well as scaled_ 


After Scaling 


—— Scaled value 








00 O72 


value and the datetime. 


df 


& Ww AW 


-head(5) 


timestamp 
2014-07-01 00:00:00 
2014-07-01 00:30:00 
2014-07-01 01:00:00 
2014-07-01 01:30:00 
2014-07-01 02:00:00 


- 


4 


value 
10844 
8127 
6210 
4656 
3820 


== 


06 8 


10 


Datetime scaled_value 


2014-07-01 00:00:00 
2014-07-01 00:30:00 
2014-07-01 01:00:00 
2014-07-01 01:30:00 
2014-07-01 02:00:00 


Figure 6-23. The modified dataframe 


0.276506 
0.207175 
0.158259 
0.115605 
0.087272 


229 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


There are 10320 data points in the sequence and your goal is to find anomalies. This 
means you are trying to find out when data points are abnormal. If you can predict a 
data point at time T based on the historical data until T-1, then you have a way of looking 
at an expected value compared to an actual value to see if you are within the expected 
range of values for time T. If you predicted that ypred number of taxis are in demand on 
January 1, 2015, then you can compare this ypred with the actual yactual. The difference 
between ypred and yactual gives the error, and when you get the errors of all the points 
in the sequence, you end up with a distribution of just errors. 

To accomplish this, you will use a sequential model using Keras. The model consists 
of a LSTM layer and a dense layer. The LSTM layer takes as input the time series data and 
learns how to learn the values with respect to time. The next layer is the dense layer (fully 
connected layer). The dense layer takes as input the output from the LSTM layer, and 
transforms it into a fully connected manner. Then, you apply a sigmoid activation on the 
dense layer so that the final output is between 0 and 1. 

You also use the adam optimizer and the mean squared error as the loss function. 
Figure 6-24 shows the code to build a LSTM model. 


time_steps = 438 
metric = ‘mean_absolute_error' 


model = Sequential() 
model.add(LSTM(units=32, activation="tanh', input_shape=(time_steps, 1), return_sequences=True)) 


model.add(Dense(1, activation="sigmoid')) 


model.compile(optimizer="adam', loss="mean_absolute_error', metrics=[metric]) 
print(model.summary()) 


Layer (type) Output Shape Param # 
istm_5 (LSTM) (None, 48, 32) 4352 
dense_5 (Dense) (None, 46, 1) 33 


EE ZEUS EEE EEE TEESE EEE IEEE EEE TEE EEE ETE EEE 
Total params: 4,385 

Trainable params: 4,385 

Non-trainable params: @ 


None 


Figure 6-24. Code to build a LSTM model 


230 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


As shown above, you used a LSTM layer. Let’s look at the details of the LSTM 
layer function with all the possible parameters (Source: https: //keras.io/layers/ 


recurrent/): 


keras.layers.LSTM(units, activation=‘tanh; recurrent_ 
activation="hard_sigmoid, use_bias=True, kernel_ 
initializer=‘glorot_uniform, recurrent_initializer=‘orthogonal, 
bias_initializer="zeros, unit_forget_bias=True, 
kernel_regularizer=None, recurrent_regularizer=None, 
bias_regularizer=None, activity_regularizer=None, 
kernel_constraint=None, recurrent_constraint=None, 
bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, 
implementation=1, return_sequences=False, return_state=False, 
go_backwards=False, stateful=False, unroll=False) 


Arguments 


units: Positive integer, dimensionality of the output space 


activation: Activation function to use (see https: //keras.io/ 
activations). Default: hyperbolic tangent (tanh). If you pass None, 
no activation is applied (i.e. “linear” activation: a(x) = xX). 


recurrent_activation: Activation function to use for the recurrent 
step (see https: //keras.io/activations). Default: hard sigmoid 
(hard sigmoid). If you pass None, no activation is applied (ie. “linear” 
activation: a(x) = x). 


use_bias: Boolean, whether the layer uses a bias vector 


kernel_initializer: Initializer for the kernel weights matrix, used 
for the linear transformation of the inputs (see https: //keras.io/ 
initializers) 


recurrent_initializer: Initializer for the recurrent_kernel weights 
matrix, used for the linear transformation of the recurrent state 
(see https: //keras.io/initializers). 


bias_initializer: Initializer for the bias vector (see https: //keras. 
10/initializers) 


231 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


e unit_forget_bias: Boolean. If True, add 1 to the bias of the 
forget gate at initialization. Setting it to true will also force bias _ 
initializer="zeros". This is recommended in Jozefowicz et al. 
(2015). 


e kernel_regularizer: Regularizer function applied to the kernel 
weights matrix (see https: //keras.io/regularizer) 


e recurrent_regularizer: Regularizer function applied to the 
recurrent_kernel weights matrix (see https: //keras.io/ 
regularizer) 


e bias_regularizer: Regularizer function applied to the bias vector 
(see https: //keras.io/regularizer) 


e activity_regularizer: Regularizer function applied to the output of 
the layer (its “activation”) (see https://keras.io/regularizer) 


e kernel_constraint: Constraint function applied to the kernel 
weights matrix (see https: //keras.io/constraints) 


e recurrent_constraint: Constraint function applied to the recurrent_ 
kernel weights matrix (see https: //keras.io/constraints) 


e bias_constraint: Constraint function applied to the bias vector (see 
https://keras.io/constraints) 


e dropout: Float between 0 and 1. Fraction of the units to drop for the 


linear transformation of the inputs. 


e recurrent_dropout: Float between 0 and 1. Fraction of the units to 
drop for the linear transformation of the recurrent state. 


e implementation: Implementation mode, either 1 or 2. Mode 1 will 
structure its operations as a larger number of smaller dot products 
and additions, whereas mode 2 will batch them into fewer, larger 
operations. These modes will have different performance profiles on 
different hardware and for different applications. 


e return_sequences: Boolean. Whether to return the last output in the 
output sequence, or the full sequence. 


232 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


e return_state: Boolean. Whether to return the last state in addition 
to the output. The returned elements of the state’s list are the hidden 
state and the cell state, respectively. 


e go_backwards: Boolean (default False). If True, process the input 
sequence backwards and return the reversed sequence. 


e stateful: Boolean (default False). If True, the last state for each 
sample at index iin a batch will be used as the initial state for the 
sample of index i in the following batch. 


e unroll: Boolean (default False). If True, the network will be unrolled, 
else a symbolic loop will be used. Unrolling can speed up a RNN, 
although it tends to be more memory-intensive. Unrolling is only 
suitable for short sequences. 


If you notice the LSTM call in the above code snippet, there is a parameter time_ 
steps=48 being used. This is the number of steps in the sequence that is used in training 
LSTM. 48 clearly means 24 hours, since your data points are 30 minutes apart. You can 
try changing this to 64 or 128 and see what happens to the output. 

Figure 6-25 shows the code to split the sequence into a tumbling window of 
sub-sequences of length 48. Note the shape of sequence_trimmed, which is 215 
subsequences of 48 points each with 1 dimension at each point (clearly you only have 
scaled_value as a column at each time stamp). 


sequence = np.array(df['scaled_value'])) 
print(sequence) 

time_steps = 48 

samples = len(sequence) 

trim = samples % time_steps 

subsequences = int(samples/time_steps) 
sequence_trimmed = sequence[:samples - trim] 


print(samples, subsequences) 
sequence _trimmed.shape = (subsequences, time_steps, 1) 
print(sequence_trimmed. shape) 


(@.27650616 @.20717548 @.1582587 ... @.69664957 @.6783281 @.67059634) 
10328 215 
(215, 48, 1) 


Figure 6-25. Code to create subsequences 


233 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Now, let’s train your model for 20 epochs, using the training set as the validation 


data. You can do so as follows. Figure 6-26 shows the code to train the model. 


training_dataset = sequence_trimmed 
print("training_dataset: ", training_dataset. shape) 


batch_size=32 
epochs=20 


model.fit(x=training_dataset, y=training_dataset, 


batch_size=batch_size, epochs=epochs, 
verbose=1, validation_data=(training_dataset, training_dataset), 
callbacks=[(TensorBoard(log_dir='../logs/{®}".format(tensorlog)))) 


training_dataset: (215, 48, 1) 

Train on 215 samples, validate on 215 samples 

Epoch 1/20 

215/215 [===22ssesesssssssssssrssssresss)] - 15 6ms/step 
val_loss: @.0377 - val_mean_absolute_error: @.0377 
Epoch 2/20 

215/215 [ se eeesessesse sees sees |} = 1s Sms /step 
val_loss: @.037@ - val_mean_absolute_error: 9.0370 
Epoch 3/20 

215/215 [SSeeeeesseseseesesssesssseeses | - 1s Sms/step 
val_loss: 0.0361 - val_mean_absolute_error: 0.0361 
Epoch 4/20 

215/215 [SSSeeeesssssssssessssssssssses] - is Sms/step 
val_loss: @.0354 - val_mean_absolute_error: @.@354 
Epoch 5/20 

215/215 [seeceeesesssssessesssesseseses)] - 15 Sms/step 
val_loss: 0.0346 - val_mean_absolute_error: 0.0346 
Epoch 6/20 

215/215 [seeceeeseessessseeeseeseeseses)] - 15 Sms/step 
val_loss: 0.0339 - val_mean_absolute_error: 6.0339 
Epoch 7/20 

215/215 [===sseesesssssssesssesssssss=e2] - 15 5ms/step 
val_loss: 0.0332 - val_mean_absolute_error: 0.0332 
Epoch 8/20 

215/215 (SSS Sessssssssesesssessssess==] - 15 Sms/step 
val_loss: 0.0326 - val_mean_absolute_error: 6.0326 
Epoch 9/20 

215/215 [222ssessesseeseseseseseseseees)] - 15 Sms/step 
val_loss: 0.0319 - val_mean_absolute_error: @.@319 
Epoch 10/206 

215/215 [SS28eeesssssseeressssssssesees] - is Sms/step 
val_loss: 0.0315 - val_mean_absolute_error: ©.0315 
Epoch 11/20 

215/215 [SsS2SSssssssssseeessesssesee==) - 1s Sms/step 
val_loss: 6.0309 - val_mean_absolute_error: 6.0309 
Epoch 12/26 

215/215 [==2sseeeessssssesessrsssssssss] - 15 Sms/step 
val_loss: 6.0299 - val_mean_absolute_error: 6.6299 
Epoch 13/20 

215/215 [seceseeessseeeeeessseseesesees)] - 15 Sms/step 
val_loss: 0.0302 - val_mean_absolute_error: ©.@302 
Epoch 14/20 

215/215 [se eeeeeeseese esses eee |) - ls Sms/step 
val_loss: 6.0269 - val_mean_absolute_error: @.@289 


Figure 6-26. Code to train the model 


234 


loss: 


- 1055: 


105s: 


loss: 


1055: 


loss: 


loss: 


loss: 


loss: 


1055: 


1055: 


loss: 


loss: 


loss: 


8.8382 


8.0376 


0.0367 


9.8356 


8.8351 


9.0344 


9.0343 


@.86331 


8.0324 


8.03186 


8.0311 


8.8305 


8.8300 


8.8293 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


@.0362 


@.0376 


0.0367 


@.0353 


@.0351 


@.0344 


@.0343 


@.0331 


@.0324 


@.0315 


@.0311 


@.0305 


@.0300 


@.0295 


CHAPTER 6 


Epoch 15/20 

215/215 [===s=s==ssssesssesesee=eee======] - 1s Sms/step 
val_loss: 8.0280 - val_mean_absolute_error: @.0280 
Epoch 16/20 

215/215 [seeeeeeeesesesessesesssssseses | - 1s Sms/step 
val_loss: @.0272 - val_mean_absolute_error: @.0272 
Epoch 17/20 

215/215 [======s=ssssssesseeeeee=e=e====] - 15 Sms/step 
val_loss: @.0265 - val_mean_absolute_error: @.0265 
Epoch 18/20 

215/215 [sssesesessesessssssssssssssss=] - 1s Sms/step 
val_loss: @.0261 - val_mean_absolute_error: @.0261 
Epoch 19/20 

215/215 [sseseseseeeesesesesssssessssss=] - 1s Sms/step 
val_loss: 0.0254 - val_mean_absolute_error: 9.0254 
Epoch 20/20 

215/215 [====sssss=====================] - 1s 6ms/step 
val_loss: @.0248 - val_mean_absolute_error: @.0248 


Figure 26. (continued) 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


8.8286 


8.0278 


8.0270 


8.0265 


8.8260 


8.0251 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


mean_absolute_error: 


@.0286 


@.0278 


@.0278 


@.0265 


@.0268 


@.0251 


Figure 6-27 shows the plotting of the loss during the training process through the 


epochs of training. 


loss 


loss 


0.038 
0.036 | 
0.034 
0.032 | 
0.03 
0.028 + 


0.026 


Figure 6-27. Graph of loss in TensorBoard 





LONG SHORT-TERM MEMORY MODELS 


Figure 6-28 shows the plotting of the mean absolute error during the training process 


through the epochs of training. 


235 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


mean_absolute_error 


Figure 6-28. Graph of mean absolute error in TensorBoard 


Figure 6-29 shows the plotting of the loss of validation during the training process 
through the epochs of training. 


val_loss 


0.042 

0.04 
0.038 
0.036 
0.034 
0.032 

0.03 
0.028 
0.026 
0.024 


0.022 


5 0 5 10 15 20 25 


Figure 6-29. Graph of loss of validation in TensorBoard 


236 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-30 shows the plotting of the mean absolute error of validation during the 
training process through the epochs of training. 


val_mean_absolute_error 


0.042 


0.028 

0.026 

0.024 

0.022 

0.02 
5 0 > 10 15 20 25 


Figure 6-30. Graph of mean absolute error of validation in TensorBoard 


Figure 6-31 shows the graph of the model as visualized by TensorBoard. 


237 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 





Figure 6-31. Graph of the model as visualized by TensorBoard 


Once the model is trained, you can predict a test dataset that is split into 
subsequences of the same length (time_steps) as the training datasets. Once this is done, 
you can then compute the root mean square error (RMSE). 


238 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-32 shows the code to predict on the testing dataset. 


import math 


from sklearn.metrics import mean_squared_error 


sequence = np.array(df[‘'scaled_value']) 
print(sequence) 

time_steps = 48 

samples = len(sequence) 

trim = samples % time_steps 

subsequences = int(samples/time_steps) 
sequence_trimmed = sequence[:samples - trim] 


print(samples, subsequences) 
sequence_trimmed.shape = (subsequences, time_steps, 1) 
print(sequence_trimmed. shape) 


testing dataset = sequence trimmed 
print("testing_ dataset: ", testing _dataset.shape) 


testing pred = model.predict(x=testing dataset) 
print("“testing_ pred: ", testing _pred.shape) 


testing dataset = testing _dataset.reshape((testing_dataset.shape[@]*testing dataset.shape[1]), testing. 
print("testing dataset: ", testing dataset.shape) 


testing pred = testing_pred.reshape((testing_pred.shape[@]*testing_pred.shape[1]), testing_pred.shape[: 
print("testing_pred: “, testing _pred.shape) 

errorsDOF = testing dataset - testing_pred 

print(errorsDF .shape) 

rmse = math.saqrt(mean_squared_error(testing_dataset, testing_pred)) 

print('Test RMSE: %.3f° % rmse) 


[@.27650616 @.20717548 @.1582587 ... @.69664957 @.6783281 @.67059634] 
10320 215 

(2725.° 46;: 3) 

testing dataset: (215, 48, 1) 

testing pred: (215, 48, 1) 

testing _dataset: (10320, 1) 

testing_pred: (10320, 1) 

(10320, 1) 

Test RMSE: 8.040 


Figure 6-32. Code to predict on the testing dataset 


RMSE is 0.040, which is quite low, and this is also evident from the low loss from 

the training phase after 20 epochs: loss: 0.0251 - mean_absolute_error: 0.0251 - 
val_loss: 0.0248 - val_mean_absolute_error: 0.0248 

Now you can use the predicted dataset and the test dataset to compute the difference 
as diff, which is then passed through vector norms. Calculating the length or magnitude 
of vectors is often required directly as a regularization method in machine learning. 
Then you can sort the scores/diffs and use a cutoff value to pick the threshold. This 
obviously can change as per the parameters you choose, particularly the cutoff value 
(which is 0.99 in Figure 6-33). The figure also shows the code to compute the threshold. 


239 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


#based on cutoff after sorting errors 
dist = np.linalg.norm(testing_ dataset - testing pred, axis=-1) 


scores =dist.copy() 
print(scores.shape) 

scores.sort() 

cutoff = int(@.999 * len(scores)) 
print(cutoff) 
#print(scores[cutoff: }) 
threshold= scores[cutoff] 
print(threshold) 


(1032, ) 


10309 
@.3330642728290365 


Figure 6-33. Code to compute the threshold 


You got 0.333 as the threshold; anything above is considered an anomaly. 
Figure 6-34 shows the code to plot testing dataset (GREEN) and the corresponding 
predicted dataset (RED). 


plt.figure(figsize=(24,16)) 
plt.plot(testing_dataset, color='green") 
plt.plot(testing pred, color='red') 


[<matplotlib.lines.Line2D at @x2bc@82169e3>] 





| 


nM nt | tT f \ | i) iy | i) 





Hl | | HAA A ll | | 
| mm | i Ll Wn in | | 


Figure 6-34. Plotting the testing and predicted datasets 


240 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-35 shows the code to classify a datapoint as anomaly or normal. 


#Label the records anomalies or not based on threshold 
Zz = zip(dist >= threshold, dist) 


y_label=[] 
error = [] 
for idx, (is_anomaly, dist) in enumerate(z): 
if is_anomaly: 
y_label.append(1) 
else: 
y_label.append(@) 
error.append(dist) 


Figure 6-35. Code to classify a datapoint as anomaly or normal 


Figure 6-36 shows the code to plot the data points with respect to the threshold. 


viz = Visualization() 
viz.draw_anomaly(y_label, error, threshold) 


Anomalies 





Figure 6-36. Code to plot the data points with respect to the threshold 


241 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-37 shows the code to append the anomaly flag to the dataframe. 


adf = pd.DataFrame({‘Datetime’: df["Datetime’], ‘observation’: df[‘value'], 
‘error’: error, ‘anomaly’: y_label}) 
adf.head(5) 


Datetime observation error anomaly 


0 2014-07-01 00:00:00 10844 0.150302 0 


1 2014-07-01 00:30:00 $127 0.147602 0 
2 2014-07-01 01:00:00 6210 0.109466 0 
3 2014-07-01 01:30:00 4656 0.063570 0 
4 2014-07-01 02:00:00 3820 0.019833 0 


Figure 6-37. Code to append the anomaly flag to the dataframe 


Figure 6-38 shows the code to generate a graph showing the anomalies. 


figure, axes = plt.subplots(figsize=(12, 6)) 

axes.plot(adf['Datetime’], adf[*observation'], color='g') 

anomaliesDF = adf.query(‘anomaly == 1°) 

axes. scatter (anomaliesDF['Datetime'].values, anomaliesDF['observation'], color="r'") 
plt.xlabel( "Date time’) 

plt.ylabel( ‘observation’ ) 

plt.title('Time Series of value by date time") 


Text(@.5, 1.0, ‘Time Series of value by date time") 


Time Senes of Value by Date Time 





2014-07 2014.08 2014.09 2014-10 2014-11 2014-12 2015-01 2015-02 
Cate Tene 


Figure 6-38. A graph showing anomalies 


242 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


In above graph you can spot an anomaly around Thanksgiving Day, one around New 
Year Eve, and another one possibly on a snow storm day in January. 

If you play around with some of the parameters you used, such as number of time_ 
steps, threshold cutoffs, epochs of the neural network, batch size, and hidden layer, you 
will see different results. 

A good way to improve the detection is to curate good normal data, use identified 
anomalies, and put it in the mix to have a way to tune the parameters until you get good 


matches on the identified anomalies. 


Examples of Time Series 
art_daily_no_noise 


This data set has no noise or anomalies and is a normal time series dataset. As you can 
see below, the time series has values at different timestamps. 

Dataset: art_daily_no_noise.csv 

Figure 6-39 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 


10 | 


: 


06 4 



































o . 
= — Value 
= 
o44 
02 4 
00 4 
Ol 02 o3 o 6 6 or eB o 10 rr YF. Ly 4 
Apr 
2014 
Date Time 


Figure 6-39. A graph showing the time series 


243 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Using visualization, you can plot the new time series now. As shown below, the time 
series shows the datatime vs. the value column. Since there are no anomalies, everything 


is green. Figure 6-40 shows code to generate a graph showing anomalies. 


figure, axes = plt.subplots(figsize=(12, 6)) 

axes.plot(adf[‘Datetime’], adf[‘observation'], color='g') 

anomaliesDF = adf.query('anomaly == 1") 

axes .scatter(anomaliesOF[ 'Datetime'].values, anomaliesDF["observation'], color='"r") 
plt.xlabel('Date time’) 

plt.ylabel(‘observation') 

plt.title( "Time Series of value by date time’) 


Text(@.5, 1.0, "Time Series of value by date time’) 


Time Senes of Value by Date Time 


os; ) | 


a 
” 


Lsersathon 


oa 
de 


Oz 4 


o0 4 


u14 4 Ol 14-44.03 14 OM is 14.04 of 201 4-04..09 41 4-4.11] 14.04 13 214.04 15 
Dace Time 


Figure 6-40. A graph showing anomalies 


Since this data set has no noise or anomalies and is a normal time series dataset, 
there are no anomalies (datapoints in RED) shown and everything is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


art_daily_nojump 


This data set has no noise or anomalies and is a normal time series dataset. As you can 
see below, the time series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 

Dataset: art_daily_nojump.csv 


244 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-41 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 








08 ; 


06 


o4 


O02 


Date Time 


Figure 6-41. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. the value column. Since there are no anomalies, 


everything is green. Figure 6-42 shows the code to generate a graph showing anomalies. 


245 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


figure, axes = plt.subplots(figsize=(12, 6)) 

axes.plot(adf["Datetime'], adf['observation'], color='g') 

anomaliesDF = adf.query('anomaly == 1") 

axes. scatter(anomaliesDF['Datetime'], anomaliesDF['observation'], color='r') 
plt.xlabel( ‘Date time’) 

plt.ylabel( ‘observation’ ) 

plt.title('Time Series of value by date time") 


Text(@.5, 1.0, “Time Series of value by date time") 


Time Series of Value by Date Time 


O68 


O46 


Observation 
[= J 
& 


O2 





oo 


—— eee eee 
2014-04-01 2014-04-03 2014-04-05 2014-04-07 2014-04-09 2014-04-11 201444-13 2014-04-15 


Date Time 


Figure 6-42. A graph showing anomalies 


Since this data set has no noise or anomalies and is a normal time series dataset, 
there are no anomalies (datapoints in RED) shown and everything is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


art_daily_jumpsdown 


This data set has mixture of normal data and anomalies. As you can see below, the time 
series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 


Dataset: art_daily_jumpsdown.csv 


246 


CHAPTER6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-43 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 





Date Time 


Figure 6-43. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. the value column. Normal data points are shown 
in green and anomalies are shown in red. Figure 6-44 shows the code to generate a graph 


showing anomalies. 


247 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


figure, anes = plt.subplots(figsize=(12, 6)) 

axes. plot(adt[ *‘Datetine’], adf[“observation’], éolor="g2") 

ancmaliesOF = adf.query( ‘anomaly == 1°) 

anes. scatter (anomaliesDF[ ‘Datetime’], anomaliesDF[ observation’), color='r*) 
plt.xlabel( "Date time") 

plt.ylabel( ‘observation' } 

plt.title("Time Series of value by date time") 


Text({e.5, 1.6, "Time Series of value by date time") 


Time Senet of Value by Date Tome 








M4401 NiO} MoO | OT | MOND) MO] OLS 
Duste Tite 


Figure 6-44, A graph showing anomalies 


Since this data set has some noise or anomalies, there are anomalies (datapoints in 
RED) shown and everything else that is normal is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


art_daily_perfect_square_wave 


This data set has no noise or anomalies and is a normal time series dataset. As you can 
see below, the time series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 


Dataset: art_daily_perfect_square_wave.csv 


248 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-45 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 
10 


08 


06 


; “—~ Value 


o4 
02 


00 
01 Q2 03 Oo oS 6 07 if #2) 10 ll 12 3 14 
Apr 
14 
Date Time 


Figure 6-45. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. value column. Since there are no anomalies, 


everything is green. Figure 6-46 shows the code to generate a graph showing anomalies. 


figure, axes = plt.subplots(figsize-(12, 6)) 

axes .plot(adf[‘Datetime'], adf['observation’], color="g") 

anomaliesDF = adf.query(‘anomaly == 1°) 

axes. scatter(anomaliesDF[‘Oatetime’), anomaliesDF[‘observation'), color="r*") 
plt.xlabel( "Cate time’) 

plt.ylabel (‘observation’) 

plt.title(‘Time Series of value by date time") 


Text(a.5, 1.6, "Time Series of value by date time’) 


Time Series of Value by Date Time 





24-01 2014-64-03 2014-04-05 24-07 2014-04-09 14-11 M13 2014-04-15 
Dat 


= Tier 


Figure 6-46. A graph showing anomalies 


249 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Since this data set has no noise or anomalies and is a normal time series dataset, 
there are no anomalies (datapoints in RED) shown and everything is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


art_load_balancer_spikes 


This data set has mixture of normal data and anomalies. As you can see below, the time 
series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 

Dataset: art_load_balancer_spikes.csv 

Figure 6-47 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 
10 — Value | 


08 


6 


‘aear 


o4 





O2 





| | : : Tate | | 
ss Si iy tem ag nn Sica er ee get 
Ol nr, 3 Oe is] 0% OF Lt | o 10 Li li 13 14 
Apr 
14 
Date Time 


Figure 6-47. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. the value column. Normal data points are shown 
in green and anomalies are shown in red. Figure 6-48 shows the code to generate a graph 


showing anomalies. 


250 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


figure, axes = plt.subplots(figsize=(12, 6)) 

axes. plot(adf["Datetime’], adf[‘observation'), color="g") 

anomaliesOF = adf.query(‘anomaly == 1") 

axes. scatter(anomaliesOF[‘Datetime’], anomaliesOF[‘observation’], color='r") 
plt.xlabel("Date time") 

plt.ylabel( ‘observation') 

plt.title( "Time Series of value by date time’) 


Text(8.5, 1.6, "Time Series of value by date time’) 


Time Senes of Value by Date Time 


oa I : 
* | 
tl 
| it | 
| , | 
02 
f 1! 








Observabon 





| 
00 















































2014-04-01 01440} OOF MOF OL OKO OL. 
Date Tene 
Figure 6-48. A graph showing anomalies 


Since this data set has some noise or anomalies, there are anomalies (datapoints in 
RED) shown and everything else that is normal is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


ambient_temperature_system_failure 


This data set has mixture of normal data and anomalies. As you can see below, the time 
series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 

Dataset: ambient_temperature_system_failure.csv 


251 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-49 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 





ot * 


ov 


ha 6 * ‘1 " > 
~~ os Pd ~~ ‘* Pal ~Y ie ~ ~ ~~ Pe 


Date Tine 


Figure 6-49. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. the value column. Normal data points are shown 


in green and anomalies are shown in red. Figure 6-50 shows the code to generate a graph 
showing anomalies. 


figure, axes = plt.subplots(figsize=(12, 6)) 

axes. plot(adf[‘Datetime'], adf[‘observation’], color="g*) 

anomaliesOF = adf.query( ‘anomaly == 1°) 

axes. scatter(ancmaliesOF[‘Datetime’], anomaliesOF[‘observation'), color='r") 
plt.xlabel( ‘Date time") 

plt.ylabel( ‘observation’ ) 

plt.title( "Time Series of value by date time’) 


Text(@.5, 1.6, "Time Series of value by date time’) 


Time Series of Value by Date Time 





1-07 21348 TS 1G 61-11 R12 6 61? 14) OM S14 
Date Tare 


Figure 6-50. A graph showing anomalies 


252 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Since this data set has some noise or anomalies, there are anomalies (datapoints in 
RED) shown and everything else that is normal is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


ec2_cpu_utilization 


This data set has mixture of normal data and anomalies. As you can see below, the time 
series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 

Dataset: ec2_cpu_utilization.csv 


Figure 6-51 shows the code to generate a graph showing the time series. 


Time Senes of Value by Date Time 


Lo — Value | 
08 


06 


‘Walue 


o4 
02 


00 


15 16 7 18 19 n 21 n Pr M % % 7 28 
Feb 
2014 
Dite Tire 


Figure 6-51. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. the value column. Normal data points are shown 
in green and anomalies are shown in red. Figure 6-52 shows the code to generate a graph 
showing anomalies. 


253 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


figure, axes = plt.subplots(figsize=(12, 6)) 

axes.plot(adf[‘Gatetime'], adf['observation'], color='g") 

anomaliesDF = adf.query( ‘anomaly == 1°) 

axes. scatter(ancmaliesoF[ “Datetime’], anomaliesDF[ observation’), color="r*) 
plt.xlabel("Date time") 

plt.ylabel( observation’ ) 

plt.title( "Time Series of value by date time’) 


Text(@.5, 1.0, "Time Series of value by date time’) 


Time Senes of Value by Date Time 


O44 


O24 





004 


. [ih di a dT daniels Jal 
o14-02-15 14-02-17 M14-O2-19 2014-02-21 14-02-23 14-02-25: ML4-O2-27 14-03-01 
Duste ‘Tiere 


Figure 6-52. A graph showing anomalies 


Since this data set has some noise or anomalies, there are anomalies (datapoints in 
RED) shown and everything else that is normal is green. 

Next, let’s examine another dataset which is different from the current dataset. You 
will build a LSTM model and see if there are anomalies or not. 


rds_cpu_utilization 


This data set has mixture of normal data and anomalies. As you can see below, the time 
series has values at different timestamps. 

Using visualization, you can plot the time series now. You convert the timestamp to 
datetime for this work and also drop the timestamp column. As shown below, the time 
series shows the datatime vs. the value column. 


Dataset: rds_cpu_utilization.csv 


254 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Figure 6-53 shows the code to generate a graph showing the time series. 


Time Series of Value by Date Time 





Figure 6-53. A graph showing the time series 


Let’s add the anomaly column to the original dataframe and prepare a new 
dataframe. Using visualization, you can plot the new time series now. As shown below, 
the time series shows the datatime vs. the value column. Normal data points are shown 
in green and anomalies are shown in red. Figure 6-54 shows the code to generate a graph 
showing anomalies. 


figure, axes = plt.subplots(figsize-(12, 6)) 

axes.plot(adf["Datetime'], adf['observation'], color="g') 

anomaliesOF = adf.query( "anomaly == 1°) 

axes. scatter(anomaliesOF[ "Datetime'], anomaliesOF[ ‘observation'], color='r') 
plt.xlabel( ‘Date time’) 

plt.ylabel( ‘observation’ ) 

plt.title('Time Series of value by date time’) 


Text(@.5, 1.0, ‘Time Series of value by date time’) 


Time Series of Value by Date Time 


06 


§ | 

i. 
02 
00 














7 a - ————— as = ———— ——y - a = ae . end 
2014-02-15 2014-02-17 2014-02-19 2014-02-21 2014-02-23 2014-62-25 2014-02-27 2014-03-01 
Date Time 


Figure 6-54, A graph showing anomalies 


255 


CHAPTER 6 LONG SHORT-TERM MEMORY MODELS 


Since this data set has some noise or anomalies, there are anomalies (datapoints in 


RED) shown and everything else that is normal is green. 


Summary 


In this chapter, we discussed recurrent neural networks and long short-term memory 
models. We also looked at LSTMs as a means to detect anomalies. We also walked 
through several different examples of time series data with different anomalies and 
showed how to start detecting anomalies. 

In the next chapter, we will look at another method of anomaly detection, 


the temporal convolutional network. 


256 


CHAPTER 7 


Temporal Convolutional 
Networks 


In this chapter, you will learn about temporal convolutional networks (TCNs). You will 
also learn how TCNs work and how they can be used to detect anomalies and how you 
can implement anomaly detection using a TCN. 


In a nutshell, the following topics will be covered throughout this chapter: 
e Whatis a temporal convolutional network? 
e Dilated temporal convolutional networks 
e Encoder-decoder temporal convolutional networks 


e TCN applications 


What Is a Temporal Convolutional Network? 


Temporal convolutional networks refer to a family of architectures that incorporate 
one-dimensional convolutional layers. More specifically, these convolutions are causal, 
meaning no information from the future is leaked into the past. In other words, the 
model only processes information going forward in time. One of the problems with 
recurrent neural networks in the context of language translation is that it reads sentences 
from left to right in time, leading it to mistranslate in some cases where the order of the 
sentence is switched around to create emphasis. To solve this, bi-directional encoders 
were used, but this meant future information would be considered in the present. 
Temporal convolutional networks don’t have this problem because they don’t rely on 
information from previous time steps, unlike recurrent neural networks, thanks to their 
casuality. Additionally, TCNs can map an input sequence of any length to an output 
sequence with the same length, just as a recurrent neural network (RNN) can do. 


20 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_7 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Basically, temporal convolutional networks seem to be a great alternative to RNNs. 


These are the advantages of TCNs, specifically considering RNNs in general: 


258 


Parallel computations: Convolutional networks pair well with 

GPU training, particularly because the matrix-heavy calculations 

of the convolutional layers are well suited to the structure of GPUs, 
which are configured to carry out matrix calculations that are part of 
graphics processing. Because of this, TCNs can train much faster than 
RNNs. 


Flexibility: TCNs can change input size, filter size, increase dilation 
factors, stack more layers, etc. in order to easily be applied to various 


domains. 


Consistent gradients: Because TCNs are comprised of convolutional 
layers, they backpropagate differently than RNNs do, and thus all 

of the gradients are saved. RNNs have a problem called exploding 

or vanishing gradients, where sometimes the calculated gradient is 
either extremely large or extremely small, leading to the readjusted 
weight to be too extreme of a change or to be a relatively nonexistent 
change. To combat this, types of RNNs such as the LSTM, GRU, and 
HF-RNN, were developed. 


Lighter on memory: LSTMs store information in their cell gates so 
if the input sequence is long, much more memory is used by the 
LSTM network. Comparatively, TCNs are relatively straightforward 
because they are comprised of several layers that all share their own 
respective filters. Compared to LSTMs, TCNs are much lighter to run 


in regards to their memory usage. 


However, TCNs do carry some disadvantages: 


Memory usage during evaluation mode: RNNs only need to 

know some input xt to generate a prediction, since they maintain 

a summary of everything they learned through their hidden state 
vectors. In comparison, TCNs need the entire sequence up until the 
current point again to make an evaluation, leading to potentially 


higher memory usage than an RNN. 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Problems with transfer learning: First, let’s define what transfer 
learning is. Transfer learning is when a model has been trained for 
one particular task (classifying vehicles for example), and has the last 
layer(s) taken out and retrained completely so that the model can be 


used for a new classification task (classifying animals, for example). 


In computer vision, there are some really powerful models, such 
as the inception-v3 model, that have been trained on powerful 
GPUs for quite some time in order to achieve the performances 
that they do. Instead of training our own CNN from the ground up 
(and most of us don’t have the GPU hardware or the time to spend 
in long training an extremely deep model like inception-v3), we 
can simply take inception-v3, for example, which is really good 

at extracting features out of images, and train it to associate the 
features that it extracts with a completely new set of classes. This 
process takes a lot less time since the weights in the entire network 
are already well optimized, so you're only concerned with finding 
the optimal weights for the layers you are retraining. 


That’s why transfer learning is such a valuable process; it allows us 
to take a pretrained, high-performance model and simply retrain 
the last layer(s) with our hardware and teach the model a new 
classification task (for CNNs). 


Going back to TCNs, the model might be required to remember 
varying levels of sequence history in order to make predictions. 

If the model did not have to take in as much history in the old task 
to make predictions, but in the new task it had to receive even 
more/less history to make predictions, that would cause issues 
and might lead the model to perform poorly. 


In a one-dimensional convolutional layer, we still have parameter k to determine the 


size of our kernel, or filter. The way the convolutional layer works is pretty similar to the 


two-dimensional convolutional layer you looked at in Chapter 3, but we are only dealing 


with vectors in this case. 


Here’s an example of what the one-dimensional convolutional operation looks like. 


Assuming an input vector defined as in Figure 7-1, 


259 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


x=] 10 5 15 20 10 20 | 


Figure 7-1. A vector x defined with these corresponding values. This is the input 
vector 


and a filter initialized as in Figure 7-2, 
Filter Weights 


4 0.2 0.1 


Figure 7-2. The filter weights associated with this one-dimensional 
convolutional layer 


the output of the convolutional layer is calculated as shown in Figure 7-3, Figure 7-4, 
Figure 7-5, and Figure 7-6. 


Input: 10 5 15 20 10 20 
* —s 
iter Weights: | 1 0.2 0.1 | 
4 4 
T 10+1+1.5 
| 


Output: 12.5 


Figure 7-3. How the first entry of the output vector is calculated using the filter 
weights. The filter weights are multiplied element-wise with the first three entries in 
the input, and the results are summed up to produce the output value 


260 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Input: 10 5 15 20 10 20 
* Lo a 
Filter Weights: E 0.2 0.1 | 
1 5+3+2 
| 
Output: 125 10 | 


Figure 7-4. How the second entry of the output vector is calculated using the 
filter weights. The procedure is the same as in Figure 7-3, but the filter weights are 
shifted right one 


Input: “10 5 15 20 10 20 


a f F ia} 
i oa 
I 15+4+1 
| 


Output: 125 10 20 | 


Figure 7-5. How the third entry of the output vector is calculated using the filter 
weights 


Input: 10 5 15 20 10 20 


* , 44 
Filter Weights: | 10.2 0.1 | 
La 4 
1 20+2+2 
! 


Output: 125 10 20 24 


Figure 7-6. How the last entry of the output vector is calculated using the filter 
weights 


261 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Now we have the output of the one-dimensional convolutional layer. These 
one-dimensional convolutional layers are quite similar to how two-dimensional 
convolutional layers work, and they comprise nearly the entirety of the two different 
TCNs we will look at: the dilated temporal convolutional network and the encoder- 
decoder based temporal convolutional network. It is important to note that both 
models involve supervised anomaly detection, although the encoder-decoder TCN is 
capable of semi-supervised anomaly detection since it is an autoencoder. 


Dilated Temporal Convolutional Network 


In this type of TCN, we deal with a new property known as a dilation. Basically, when the 
dilation factor is greater than 1, we introduce gaps in the output data that correspond to 
the dilation factor. To understand the concept of dilation better, let’s look at how it works 
for a two-dimensional convolutional layer. 

This is a standard convolution, equivalent to what you looked at in Chapter 3. 
You can also think of a standard convolutional layer as having a dilation factor of one 


(refer to Figure 7-7). 


Feature Map 











Figure 7-7. A standard convolution with a dilation factor of one 


Now, let’s look at what happens when we increase the dilation factor to two. For the 


first entry in the feature map, the convolution looks like Figure 7-8. 


262 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Feature Map 





Figure 7-8. A standard convolution with a dilation factor of two defining the first 
entry in the feature map 


Notice that the spacing between each sampled entry has increased by one across all 
directions. Vertically, horizontally, and diagonally, the sampled entries are all spaced 
apart by one entry. Essentially, this spacing is determined by finding what d - 1 is, where 
d is the dilation factor. For a dilation factor of three, this spacing will be two apart. Now, 


for the second entry, the convolution process proceeds as normal (see Figure 7-9). 


Feature Map 





Figure 7-9. The convolution with a dilation factor of two defining the second 
entry in the feature map 


263 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Once the process terminates, we will have our feature map. Notice the reduction 
in dimensionality of the feature map, which is a direct result of increasing the dilation 
factor. In the standard two-dimensional convolutional layer, we had a 4x4 feature map 
since the dilation factor was one, but now we have a 3x3 feature map after increasing this 


factor to two. 

A one-dimensional dilated convolution is similar. Let’s revisit the one-dimensional 
convolution example and modify it a bit to illustrate this concept. 

Assume now that the new input vector and filter weights are as shown in Figure 7-10 


and Figure 7-11. 


K=/28124642 12) 


Figure 7-10. The new input vector weights 
and 


Filter Weights 


0.5 0.2 0.4 


Figure 7-11. The new filter weights 


Let’s also assume now that the dilation factor is two, not one. The new output vector 
is the following, using dilated one-dimensional convolutions with a dilation factor of two 


(see Figure 7-12, Figure 7-13, Figure 7-14, and Figure 7-15). 


Spacing of 1 


[— 
Input: 281246 4 2 12 | 


* hL Lv 


Filter Weights:| 0.5 0.2 0.4] 


14 4 


II 1+2.4+2.4 


Output: 5.8 | 


Figure 7-12. Calculating the first entry in the output factor using dilated 
one-dimensional convolutions with a dilation factor of two 


264 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Input: 2 8 12 46 4 2 12 


: { | 


Filter Weights: 05 0.2 0.4 


144 


T 4+0.8+1.6 
Output: 528 6.4 | 


Figure 7-13. The next set of three input vector values are multiplied with the filter 
weights to produce the next output vector value 


Input: 2 8 124642 12 


, | 


Filter Weights: 0.5 0.2 0.4] 


1 44 


I 6+1.2+0.8 


Output: 58 6.4 8 | 


Figure 7-14. The third set of three input vector values are multiplied with the filter 
weights to produce the next output vector value 


Input: 2 8 12 46 4 2 12 


: , | ¢ 


Filter Weights: 0.5 0.2 0.4 


1 + 4 


T 2+0.8+ 4.8 


Output: 58 6.4 8 7.6 | 


Figure 7-15. The final set of three input vector values are multiplied with the filter 
weights to produce the last output vector value 


265 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Now that we’ve covered what a dilated convolution looks like in the context of one- 
dimensional convolutions, let’s look at the difference between an acausal and a casual 
dilated convolution. To illustrate this concept, assume that both examples are referring 
to a set of dilated one-dimensional convolutional layers. With that in mind, Figure 7-16 
shows an acausal network. 


Output Layer 


| isa 
wader? | 1 TT | | a EE 
| NN 


Hidden Layer 1 








[Hane HD 


Figure 7-16. An acausal dilated network. The first hidden layer has a dilation 
factor of two, and the second hidden layer has a dilation factor of four. Notice how 
inputs “forward in the sequence” contribute to the next layer’s node as well 


Input Layer 





It might not be that apparent from the way the architecture is structured, but if you 
think of the input layer as a sequence of some data going forward in time, you might be 
able to see that information from the future would be accounted for when selecting the 
output. In a casual network, we only want information that we've learned up until the 
present, so none of the information from the future will be accounted for in the model’s 
predictions. Figure 7-17 shows what a causal network looks like. 


266 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


omni TT EE EET EL EL 
| 

ne ne a 
a ae ee 


Hidden Layer 1 





Bann fs 


Figure 7-17. A causal dilated network. The first hidden layer has a dilation factor 
of two, and the second hidden layer has a dilation factor of four. Notice how no 
inputs forward in the sequence contribute to the next layer’s node. This type of 
structure is ideal if the goal is to preserve some sort of flow within the data set, 
which is time in our case 


Input Layer 





From this, we can see how the linear nature of time is preserved in the model, and 
how no information from the future would be learned by the model. In casual networks, 
only information from the past until the present is considered by the model. The dilated 
temporal convolutional network we are referring to has a similar model architecture, 


utilizing dilated causal convolutions in each layer preceding the output layer. 


Anomaly Detection with the Dilated TON 


Now that you know more about what a TCN is and how it works, let’s try applying a 
dilated TCN to the credit card dataset. 


First, import all of the necessary packages (see Figure 7-18a). 


267 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


import keras 

from keras import regularizers, optimizers 

from keras import losses 

from keras.models import Sequential, Model, load_model 

from keras.layers import Dense, Input, Dropout, Embedding, LSTM 
from keras.optimizers import RMSprop, Adam, Nadam 

from keras.preprocessing import sequence 


from keras.layers import Conv1D, Flatten, Activation, SpatialDropout1D 
from keras.callbacks import ModelCheckpoint, TensorBoard 
from keras.utils import to_categorical 


import sklearn 
from sklearn.preprocessing import StandardScaler, MinMaxScaler 


from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, roc_auc_score 
from sklearn.metrics import classification_report 


import seaborn as sns 
import pandas as pd 
import numpy as np 
import matplotlib 


import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 
Smatplotlib inline 


import tensorflow 
import sys 
print("Python: ", sys.version) 


print("pandas: ", pd.__version_) 
print("numpy: “, np.__version_) 
print("seaborn: “, sns.__version_) 
print("matplotlib: ", matplotlib._version_) 
print("sklearn: “, sklearn.__version_) 
print("Keras: “, keras.__version_) 
print("Tensorflow: ", tensorflow.__version_) 


Python: 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)] 
pandas: 9@.24.2 

numpy: 1.16.4 

seaborn: 9.9.0 

matplotlib: 
sklearn: 9.2 
Keras: 2.2.4 
Tensorflow: 1.13.1 


3.1.0 
1.2 


Figure 7-18a. Importing all of the necessary packages in order to start your code 


Then, you must create a class for the visualization of confusion matrix, etc. 
(see Figure 7-18b). 


268 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


class Visualization: 
labels = ["“Normal", “Anomaly”] 


def draw_confusion_matrix(self, y, ypred): 
matrix = confusion_matrix(y, ypred) 


plt.figure(figsize=(10, 8)) 

colors=[ “orange”, "green" ] 

sns.heatmap(matrix, xticklabels=self.labels, yticklabels=self.labels, cmap=colors, annot=True, 
plt.title("Confusion Matrix") 

plt.ylabel( ‘Actual’ ) 

plt.xlabel('Predicted’) 

plt.show() 


def draw_anomaly(self, y, error, threshold): 
groupsDF = pd.DataFrame({‘error’: error, 
‘true’: y}).groupby( ‘true’ ) 


figure, axes = plt.subplots(figsize=(12, 8)) 


for name, group in groupsDF: 
axes. plot(group.index, group.error, marker="x' if name == 1 else ‘o', linestyle="", 
color='r" if name == 1 else ‘'g’, label="Anomaly” if name == 1 else “Normal") 


axes. hlines(threshold, axes.get_xlim()[@], axes.get_xlim()[1], colors="b", zorder=100, label=" 
axes. legend() 


plt.title("Anomalies”) 
plt.ylabel("Error™) 
plt.xlabel("Data") 
plt.show() 


def draw_error(self, error, threshold): 
plt.plot(error, marker='0', ms=3.5, linestyle='', 
label='Point') 


plt.hlines(threshold, xmin=@, xmax=len(error)-1, colors="b", zorder=100, label="Threshold’ 
plt.legend() 

plt.title("Reconstruction error”) 

plt.ylabel("Error”) 

plt.xlabel("Data") 

plt.show() 


Figure 7-18b. Creating a visualization class 


After that, proceed to importing your data set and processing it (see Figure 7-19). 


Gr = pd.read Gsy ("datasets /credi tcardiraud/créeditcard.cav", 
sep=",", index col=None) 


print (df.shape) 


df.head() 





Figure 7-19. Importing your data set and displaying the first five entries 


269 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


The output should look somewhat like Figure 7-20. 


In [171]: 


Out (171): 


df = pd.read csv("datasets/creditcardfraud/creditcard.csv", sep=",", index _col=None) 
print (df.shape) 
df.head() 


(284607, 31) 


Time 


0 0.0 


a 


0.0 
1.0 
1.0 
2.0 


> ww N 


< 


v1 
-1.359807 
1.191857 
-1.358354 
-0.966272 
-1.158233 


5 rows x 31 columns 


V2 
-0.072781 
0.266151 
-1.340163 
-0.185226 
0.877737 


V3 
2.536347 
0.166480 
1.773209 
1.792993 
1.548718 


v4 
1.378155 
0.448154 
0.379780 
-0.863291 
0.403034 


V5 
-0.338321 
0.060018 
-0.503198 
-0.010309 
-0.407193 


V6 
0.462388 
-0.082361 
1.800499 
1.247203 
0.095921 


Figure 7-20. The first five entries of the data frame 


The data frame continues in Figure 7-21. 


-0.018307 

- 0.225775 
0.247996 

. 0.108300 

. 70.009431 


Figure 7-21. 


V21 


0.277838 


0.005274 
0.798278 


V22 


-0.638672 
0.771679 


V23 


0.101288 
0.909412 
-0.190321 


V24 
-0.110474 0.066928 
-0.339846 
-0.669281 
-1.175575 
“0.137458 0.141267 


V25 
0.128539 
0.167170 

-0.327642 
0.647376 
“0.206010 


0.125895 


V26 
-0.189115 0.133558 


-0.139097 -0.055353 
“0.221929 0.062723 
0.502292 0.219422 


The output in Figure 7-20 scrolled right 


-0.008983 


v7 
0.239599 
-0.078803 
0.791461 
0.237609 
0.592941 


V27 


-0.021053 


-0.059752 
0.061458 


vs 
0.098698 
0.085102 
0.247676 
0.377436 
-0.270533 


V28 


0.014724 


0.215153 


Amount 


v9 


0.363787... 


-0.255425__... 


-1.514654 ... 


-1.387024 ... 
0.817739... 


149.62 


2.69 


373.66 
123.50 
69.99 


Class 


o o6o06UCUD™lC GD 


-0.¢ 
0.2 
0.2 


-0.C 


Each entry is noticeably large, with 31 columns per entry. If you check the tail end of 


the data frame in Figure 7-22, 


270 


284802 
284803 
284804 
284805 
284806 


df. 


tail() 


Time 
172786.0 
172787.0 
172788.0 
172788.0 
172792.0 


v1 


-11.881118 


-0.732789 

1.919565 
~0.240440 
-0.533413 


V2 
10.071785 
-0.055080 
-0.301254 

0.530483 
-0.189733 


V3 


-9.834783 


2.035030 


-3.249640 


0.702510 
0.703337 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


v4 
-2.066656 
-0.738589 
-0.557828 
0.689799 
-0.506271 


V5 


-5.364473 


0.868229 
2.630515 


-0.377961 
-0.012546 


V6 
-2.606837 
1.058415 
3.031260 
0.623708 
-0.649617 


v7 


-4.918215 


0.024330 


-0.296827 
-0.686180 


1.577006 


v8 
7.305334 
0.294869 
0.708417 
0.679145 
-0.414650 


v9 
1.914428 


0.584800 . 
0.432454 ... 
0.392087 ... 
0.486180 ... 


ane 


5S rows x 31 columns 


< 


Figure 7-22. The tail end of the data frame. Notice how large the values for 
time get 


you can see that the data set is pretty massive with 284,807 entries in total (the index 
starts at 0). Additionally, notice how the values for time become absurdly large. If you 
pass in values this large into the model for training, you are bound to get errors with 
convergence. Not only that, it’s just good practice to normalize any large values, since it 
improves performance and training efficiency if you pass in smaller values to the model. 
Run the code in Figure 7-23 to standardize the values for Time and for Amount. 


dft['Amount'] = 
StandardScaler().f1it transform(df['Amount'].values.reshape (-1 


Ly 


df['Time'] 
1, 1)) 


= Standardscaler().f1t transtorm(di[ Time'’|].values.réeshape(- 


af. tail) 





Figure 7-23. This code standardizes the values for Time and Amount 


Now you can see that the values for the columns Time (Figure 7-24) 


271 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


In [172] df["Amount"]) = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1)) 
df["Time'] = StandardScaler(). fit _transform(df ["Time'].values.reshape(-1, 1)) 
df.tail() 
Out. [17 
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 
284802 1.641931 -11.881118 10.071785 -9.834783 -2.066656 -5.364473 -2.606837 -4.918215 7.305334 1.914428 
284803 1.641952 -0.732789 -0.055080 2.035030 -0.7338589 0.866229 1.058415 0.024330 0294869 0.584800 
264804 1.641974 1.919565 -0.30971254 -3.249640 -0.557828 2.630515 3.031260 -0.296827 0.708417 0.432454 
284805 1641974 -0240440 0.530483 0.702510 0.689799 -0.377961 O623708 -0.686180 0679145 0392087 
284806 1.642058 -0.533413 -0.189733 0.703337 -0.506271 -0.012546 -0.649617 1.577006 -0.414650 0.486180 


5 rows x 31 columns 


Figure 7-24. The standardized values for the Time column 


and for Amount (Figure 7-25) 


V21 V22 V23 V24 V25 V26 V27 V28 Amount Class 
0.213454 0.111864 1.014480 -0.509348 1.436807 0.250034 0.943651 0.823731 -0.350151 0 
0.214205 0.924384 0.012463 -1.016226 -0.606624 -0.395255 0.068472 -0.053527 -0.254117 0 
0.232045 0.578229 -0.037501 0.640134 0.265745 -0.087371 0.004455 -0.026561 -0.081839 0 
0.265245 0.800049 -0.163298 0.123205 -0.569159 0546668 0.108821 0.104533 -0.313249 0 
0.261057 0.643078 0.376777 0.008797 -0.473649 -0.818267 -0.002415 0.013649 0.514355 0 


Figure 7-25. The standardized values for the Amount column 


are much smaller and much more manageable numbers to pass in. 

Since there are so many entries in the entire data set, it’s best to limit the number 
of “normal” data entries you feed into the model since the model seems to ignore the 
anomalies if the entire data set is passed in. To avoid drowning out the anomalous data 
entries, let’s pick 10,000 normal entries to derive your training and testing data sets from 
(see Figure 7-26). 


arlLart["Class™ | 1] 


anomalies = 


dE Ldti™"class”| == 0] 


normal. = 


normal.shape 


anomalies.shape, 





Figure 7-26. Defining two data frames: anomalies and normal 


The output should look somewhat like Figure 7-27. 


212 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


In [5]: anomalies = df[df["Class"] = i] 
normal = df[df["Class"] == 0] 


anomalies.shape, normal.shape 


Out[5]): ((492, 31), (284315, 31)) 


Figure 7-27. The output of the code in Figure 7-26 


In this block of code, you name two new data frames as anomalies and normal, with 
their names corresponding to their content. Checking their shape reveals that there are 
relatively few anomalies compared to the entire data set, comprising around 0.173% of 
the whole data set. 


Now let’s get to defining your training and testing data sets (see Figure 7-28). 


for f in range(0, 20): 


normal = normal.iloc[np.random.permutation(len(normal) ) ] 


data set = pd.concat([normal[:2000], anomalies] ) 


x train, x test. = train test splatidata Set, test size = 0.4, 
random state = 42) 


a Etain. =k Crain.sore vaeluee(by—|* Time” |) 


x test. = x test.sort values (by—[* Time’ ]} 


y_tiain. = x train| "Class |] 


y test =. x. test ("Class™| 


x train-head (10) 





Figure 7-28. Defining the training and testing sets and sorting both by time to 
maintain the temporal flow 


273 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Shuffling the normal data set as well as using the train test split function to 
randomly select testing and training samples helps ensure that you pick a good range of 
data values to represent normal data. You can limit the number of iterations in the for 
block at the start of the code if you wish. 

From there, the first 10,000 data entries of the shuffled normal data are concatenated 
with the anomalies, and the training and testing data sets are created. Both sets are then 
sorted by the Time column to maintain the entire aspect of time. 

The output should look somewhat like the Figure 7-29. 


for f in range(®@, 20): 
normal = normal.iloc[np.random. permutation(len(normal))] 


data_set = pd.concat([normal[:200@], anomalies]) 


x_train, x_test = train_test_split(data_set, test_size = 6.4, random_state = 42) 
x_train = x_train.sort_values(by=[ Time" ]) 
x_test = x_test.sort_values(by=[‘"Time']) 
y_train = x_train["Class"] 
y_test = x_test[{"Class”"] 
x train. head(10) 
Time Wi V2 V3 Va V5 V6 V7 Vs vs... V21 
623 -1.05806044 -3.043541 -3.15/307 1.088403 2285044 1.350805 -1.004823 0.325574 -O.00/7704 -O0.270053 ... 0.661606 
890 -1.982432 1.238045 0.240101 6.171456 0.506075 -0.221120 -0.576537 -0.078469 0.010065 -0.083807 ... -<0.267542 
1195 -1.9077062 1.040004 -1.719288 1.556730 -0.080816 -2.156667 0.754853 -1.708567 0.360145 1.174452 ... O.307719 
16822 -1.971524 1.311511 O.388207 -0.084504 0.460422 0.195766 -0.488411 6.153118 -0.208485 -0.074842 ... -0.333871 
1535 -1.071229 1274773 ~0.472615 -0.856227 -2.280194 O.057619 -0.662405 0.286438 -O.270717 1.426729 ... 0.016321 
1872 -1.004659 -0.830004 0.851005 1.404343 -0.710150 -0.355064 -0.751566 0.508639 O.226261 0281235 ... -0.1863800 
2086 -1.962701 1.189404 0.686530 -0.002011 2.206324 0.448080 -0.314318 0.496391 -0.138754 -1.351705 ... 0.025852 
2760 -1.0438171 -0.449001 1.011487 1.756003 -0.148778 0.004508 -0.846753 1.086048 -0.465112 0.175563 ... -0.369136 
3101 -1.940527 O.026270 1.356428 -0.190007 0.680016 0.740422 -0.666152 1.000473 -0.224030 -0.334645 ... 0.052675 
3213 -1.037831 -O.676087 O774112 1.000828 -2205852 0.230830 -0.840156 0.066565 -0.140874 O7e4708 .. 0.120434 


10 rows = 31 columns 


Figure 7-29. The data sets sorted by the Time column 


Notice how the indices vary in number, although they are all ordered by time. 

Now you can move on to reshaping your data sets to pass into the model. 

Running the code block in Figure 7-30 can give you a sense of how the data sets are 
structured. 


274 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


fe) 


print ("Shapes:\nx train:%s\ny train:%s\n" % (x train.shape, 
y train.shape) ) 


fe) 


print("x test:%s\ny test:%s\n" % (x test.shape, 
y test.snape)) 





Figure 7-30. Outputs the shapes to provide an understanding of how the data sets 
are structured 


The output should look somewhat like Figure 7-31. 


print("Shapes: \nx_train:%s\ny_train:%s\n" % (x_train.shape, y_train.shape)) 
print("x_test:%s\ny_test:%s\n" % (x_test.shape, y_test.shape)) 


Shapes: 

x_train: (1495, 31) 
y_train: (1495, ) 
x_test:(997, 31) 
y_test: (997, ) 


Figure 7-31. The shapes of both data sets 


To pass the data sets into the model, the x sets must be three-dimensional, and the y 
sets must be two-dimensional. You can simply reshape the x sets, and change the y sets 
to be categorical (refer to Chapter 3 to see what the keras to_categorical() function does). 


Run the code in Figure 7-32. 


xX Crain = Np.array (xX_Crain) reshape (x Trains shape [0], 
x train.shape[l], 1) 


x USS = Nowerray (xX: Lest) .«.reshape (=. Tesu.<shapel0], 
m. Leste snepe tit, 1) 


input. shape = (x train.shape[l], 1) 


VY Urain = Keras. ull lesto: Cavegorical(y train, 2) 


Vy vest = keras:ttlils.lo.categorical (y test, 2) 





Figure 7-32. Makes the x sets three-dimensional and the y sets two-dimensional 
by reshaping the x sets and changing the y sets to be categorical. The reshaping of 
the x sets is done to fit the input shape of the model 


2795 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Let’s take a look at how the operations changed the data sets. Run the code in 
Figure 7-33. 


fe) 


PLLINC ("Shapes? \nk “trainiss \ny train? os \n" = 4x train. shape; 
¥Y_ Ulainwstepe)) 


print ("x Testscs\ny Cestiss\n" = (x test.shape, y test. shape) ) 


print ("input shape: {}\n".format (input shape) ) 





Figure 7-33. Code to print the shapes of the data sets to see how the operations 
changed the structure 


The output should look like Figure 7-34. 


print("Shapes: \nx_train:%s\ny_train:%s\n" % (x_train.shape, y_train.shape)) 
print("x_test:%s\ny_test:%s\n" % (x_test.shape, y_test.shape)) 
print("input_shape: {}\n". format (input_shape) ) 

Shapes: 

x_train:(1495, 31, 1) 

y_train: (1495, 2) 


x_test:(997, 31, 1) 
y_test:(997, 2) 


input_shape:(31, 1) 


Figure 7-34. The x sets are three-dimensional while the y sets are two-dimensional 


Alright, now both of the data sets have been reshaped successfully. The input shape 
tells the model how many columns and rows to accept per entry. In this case, the input 
shape indicates that there will be 1 row and 31 columns. 

Now let’s move on to defining your model. The code chunk in Figure 7-35 defines the 


one-dimensional convolutional layers and the dropout layers. 


276 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


input layer = Input (shape=(input_ shape ) ) 


#Series of temporal convolutional layers with dilations increasing by 
powers of 2. 


cony 1 = ConvilD(tilters=128, kernel size=2, Gilation rare=1, 
padding='causal', strides=1,input_shape=input shape, 


kernel regularizer=regularizers,12(0.01), 





activation='relu’) (input layer) 


#Dropout layer after each 1D-convolutional layer 


drop. 1 = SpatialDropoutlD (0.05) (conv 1) 


conv 2 = ConvilD(riltéers=l26, kernel size=2, dilation rate=zZ, 


padding='causal',strides=1, 
kernel regqularizer—regularizers.i2 (0.01); 





activation="relu”) (drop 1) 


drop 2 = SpatialDropoutlD (0.05) (conv. Z) 


cony 3. = ConviD(tilvers=l2s5, kernel size=2, diletion rate=4, 


padding='causal', 
Sstrides=1,;kernel regqularizer=reqularizers.t2 (0.01); 





activeation="relu’) (drop 2) 


Gropp 3 SpatialDropout1D(0.05) (conv_3) 


conv 4 ConviD(tilters=126, kernel size—2, dilation rate=s, 


padding='causal', 
Sstrides—-1, kernel requilarizer—-regulerizers.12 (0205), 





activation="relu") (arop: 3) 


drop 4 = SpatialDropout1D(0.05) (conv_ 4) 





Figure 7-35. Defines all of the one-dimensional convolutional layers and the 
dropout layers in the model 


277 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


The code chunk in Figure 7-36 defines the last two layers, which consist of a layer to 
flatten the data and one layer to represent the two classes. 


#Flatten layer to feed into the output layer 


flat = Flatten) (arop: 4) 


output layer = Dense(2, activation='softmax') (flat) 


TCN = Model (anputs=input layer, outputs=cutput: layer) 





Figure 7-36. Defines the last two layers, which consist of a layer to flatten the data 
and one layer to represent the two classes 


Now let’s compile the model and look at the summary of the layers (see Figure 7-37). 


TCN.comp1lé(loss=keres. loOsses.cacvegorical crossenlropy, 
optimizer=optimizers.Adam(lr=0.002), 


metrics=[‘mae’, ‘'accuracy']) 


checkpointer — Model Checkpoint (tileparh="model TCN creditcard.ho”, 


verbose=0, 


save best only=Trus) 


TCN.Summary () 





Figure 7-37. Code to compile the data, define a callback to save the model under 
the given filepath, and output the summary of the model 


The output should look like Figure 7-38. 


278 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


TCN.compile(loss=keras.losses.categorical_crossentropy, 
optimizer=optimizers.Adam(1r=0.002), 
metrics=[‘mae’, ‘accuracy‘]) 


checkpointer = ModelCheckpoint(filepath="model_ TCN_creditcard.hs", 
verbose=@, 
save_best_only=True) 


TCN. summary () 

Layer (type) ~— Output Shape ———(‘ié‘é rem 
input_11 (Inputayer) (None, 31,1) s—(i‘i 
convid_41 (ConviD) (None, 31, 128) 384 
spatial_dropoutid_41 (Spatia (None, 31, 128) @ 
convid_42 (ConviD) ~—~— (None, 31, 128) = © 3289” 
spatial_dropoutid_42 (Spatia (None, 31, 128) @ 
convid_43 (ConviD) = +~— (None, 31, 128) 32896 
spatial _dropoutid_43 (Spatia (None, 31, 128) @ 
convid_44 (ConviD) | ~— (None, 31, 128) (ati8289G—i—=*é 
spatial_dropoutid_44 (Spatia (None, 31, 128) @ 
flatten_11 (Flatten) (None, 3968) @ 

dense_11 (Dense) (None, 2) 7938 


Total params: 107,010 
Trainable params: 107,010 
Non-trainable params: @ 


Figure 7-38. The summary of the model. You can use this to help debug your 
models when you're creating one from scratch by checking that the output shapes 
for the layers match the input shapes of the subsequent layer 


Looking at the model summary can help you understand more about what’s 
going on at each layer. Sometimes, it can help with debugging, where there can be 
dimensionality reductions that you don’t expect. For example, sometimes when odd 
dimensions become reduced by a factor of 2, they might become rounded down. When 
expanding back up, this can prove to be problematic because the new dimension does 
not match the old dimension. You can expect to run into problems like these with 
autoencoders, where the entire aim of the architecture is to compress the data and 
attempt to reconstruct it. 


219 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Run the code in Figure 7-39 to begin the training process. 


TCN.f1C(x train, y train, 
batch size=128, 
epochs=25, 


verbose=l, 


validation data=(x test, 


callbacks 


[checkpointer] ) 


y test), 


Figure 7-39. Code to start the training process for the model 


You should see something like Figure 7-40 during the training process. 


In [169]: 


Figure 7-40. The output during the training process 


TCN.fit(x_ train, y train, 
batch _size=126, 
epochs=25, 
verbose=l, 


validation data=(x test, y test), 


callbacks = [checkpointer])) 


Train on 6295 samples, validate on 4197 samples 


Epoch 1/25 


Epoch 2/25 

6295/6295 [SSH SS SSH S SHH S SSS SSS SSS SS SS====] 
Epoch 3/25 
6295/6295 
Epoch 4/25 


[ === SSS SS ST TS LS SST SS TST SS STE ETS TS TS TST 


Epoch 5/25 
6295/6295 [==============================] 


Epoch 6/25 


4s - loss: 
Os - loss: 
Os - loss: 
Os - loss: 
Os - loss: 
Os - loss: 


3.1321 


0.1426 


0.0857 


0.0716 


0.0722 


0.0669 


oO 


- 9633 


Oo 


9889 


: 0.9900 


Oo 


- 9897 


: 0.9906 


At the end, you should see something like Figure 7-41. 


Epoch 21/25 


Epoch 22/25 


Epoch 23/25 


Epoch 24/25 


Epoch 25/25 


6295/6295 (ssssssssaasasssssssssss=======] - 0s 
6295/6295 [sssssssasssssssssssssssssssass=] - Os 
6295/6295 [=======S=SSSSSSSSSSSS=========) - Os 
6295/6295 [==========s=ssssssssssssss=s===] - Os 
6295/6295 [==========S=S==SSSS===========] - Os 


loss: 0. 


loss: 0. 


loss: 0. 


0566 - 


0575 - 


0S71 - 


loss: 9.05 


loss: 0. 


0.9916 


0.9914 


0.9909 


0.9914 


0.9913 


Figure 7-41. The output when the training process ends 


val_ loss: 
val _ loss: 
val_loss: 
val _ loss: 
val_ loss: 


val loss: 


val _ loss: 
val _ loss: 
val loss: 
val _ loss: 


val_ loss: 


-0809 


.0748 


0728 


0826 


0651 


0652 


.0701 


v 
val acc: 
val_acc: 
val _ ace: 


“r 
¥ 





o 


al acc: 


Qo 


oOo 


Qo 


oO 


al acc: 


oO 


_ace: 


val _ acc: 
val_acc: 
val acc: 
val_acc: 


val _ acc: 


Now that the training is finished, you can evaluate your model’s performance 
(see Figure 7-42). 


280 


- 9881 


9888 


-9909 


.9909 


- 9909 


- 9852 


° 
0 
@ 
wo 
°o 


oOo 
© 
© 
—J 
@ 


oO 
‘oO 
‘© 
oO 
—< 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


score = TCN.@valuate(x test, y.cest, 
verbose=0) 


print('Test loss:', score[Q]) 


print ('Test mae:', score[1]) 


print ('Test accuracy:', score[2]) 





Figure 7-42. Code to evaluate the loss and the accuracy on the test sets 


The output should look somewhat like Figure 7-43. 


score = TCN.evaluate(x_test, y_test, verbose=1) 
print('Test loss:", score[@]) 

print( "Test mae:", score[i]) 

print( "Test accuracy:'’, score[2]})| 


997/997 [s=sssssesssseseesssssssssssz=z] - 85 101us/step 
Test loss: @.17798992889814655 

Test mae: @.067/75481965109243 

Test accuracy: @.9648946840521565 


Figure 7-43. The generated loss and accuracy scores for the test set. The accuracy 
is really good, but again, accuracy isn’t always the best metric to judge models by 


Now you can check the AUC score (see Figure 7-44). 


from sklearnm.méetrics amport roc. auc score 
preds = TCN.predicte (x test) 


y_ pred = npwround (preds) 


auc = DOG ave Score y pred, vy Test) 


print ("AUC: {:.2c}".format (auc)) 





Figure 7-44. Code to generate an AUC score given the test sets and the 
predictions 


The output should look somewhat like Figure 7-45a. 


281 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


from sklearn.metrics import roc_auc_score 
preds = TCN.predict(x_test) 

y_pred = np.round(preds) 

auc = roc_auc_score( y_pred, y_test) 
print("AUC: {:.2%}".format (auc) ) 


AUC: 97.41% 


Figure 7-45a. The generated AUC score of 99.02% for this model 


For the classification report and confusion matrix, see Figure 7-45b. 


print(classification_report(y_test, y_pred)) 


precision recall fl-score support 


Q 0.96 1.02 8.98 799 

1 8.99 8.83 8.90 198 

micro avg 8.96 8.96 2.96 997 
macro avg 0.97 8.92 8.94 997 
weighted avg 0.97 @.96 @.96 997 
samples avg 0.96 0.96 @.96 997 


viz = Visualization() 

y_pred2 = np.argmax(y_pred, axis=1) 
y_test2 = np.argmax(y_test, axis=1) 
viz.draw_confusion_matrix(y_test2, y_pred2) 


Confusion Matrix 


750 

: 600 
450 
Et] 

= 

5 Le 

Normal Anomaly 
Predicted 


Figure 7-45b. Classification report and confusion matrix 


282 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


That’s a pretty good AUC score! However, this was an example of supervised 
anomaly detection, meaning you had the anomalies and the normal data labeled. You 
won t always have this luxury, and you shouldn’t expect it either because of the massive 
volumes of data that can be involved. For your next example, you will be implementing 
the encoder-decoder based temporal convolutional network (ED-TCN), but it will also 
be an instance of supervised anomaly detection so that it can be compared to the dilated 
TCN model given a similar task. However, keep in mind that since it is based on an 
autoencoder framework, the ED-TCN should also be able to perform semi-supervised 
anomaly detection. 


Encoder-Decoder Temporal Convolutional Network 


The version of the encoder-decoder TCN you will be exploring involves a combination 
of one-dimensional causal convolutional and pooling layers to encompass the encoding 
stage and a series of upsampling and one-dimensional causal convolutional layers to 
comprise the decoding stage. The convolutional layers in this model aren't dilated, but 
they still count as layers of a temporal convolutional network. To better understand the 
structure of this model, take a look at Figure 7-46. 


283 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Output 


Decoding Soh max 
Stage 


Conv 2 
Upsample 2 


Conv 1 





Upsample 1 


Output of Encoding Stage 
Encoding 


Stage 










Input of Decoding Stage 


Pool 2 


Conv 2 


Pool 1 


Input 


Figure 7-46. In both the encoding and decoding stages, the model is comprised of 
causal convolutional layers and is structured so that the layers are always causal 


The diagram might seem pretty complicated, so let’s break it down layer by layer. 

First, look at the encoding stage and start with the input layer at the very bottom. 
From this layer, you perform a causal convolution on the input as part of the first 
convolutional layer. The outputs of the first convolutional layer, which you will call 
conv_1, are now the inputs of the first max pooling layer, which you will call pool 1. 

Recall from Chapter 3 that the pooling layer emphasizes the maximum value in 
the areas it passes through, effectively generalizing the inputs by choosing the heaviest 
values. From here, you have another set of causal convolutions and max pooling with 
layers conv_2 and pool 2. Note the progressive reduction in size of the data as it passes 
through the encoding stage, a feature characteristic to autoencoders. Finally, you have a 
dense layer in the middle of the two stages, representing the final, encoded output of the 
encoding stage as well as the encoded input of the decoding stage. 


284 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


The decoding stage is a bit different in this case, since you make use of what is called 
upsampling. Upsampling is a technique in which you repeat the data n number of times 
to scale it up by a factor n. In the max pooling layers, the data is reduced by a factor of 
two. So, to upsample and increase the data by the same factor of two, you repeat the 
data twice. In this case, you are using one-dimensional upsampling, so the layer repeats 
each step n times with respect to the axis of time. To get a better understanding of what 
upsampling does, let’s apply one-dimensional upsampling to Figure 7-47 and Figure 7-48. 


X= 4267169 
Figure 7-47. A vector x defined with the corresponding values 


n=2 
So data increases by factor of 2 / 
repeat each step two times 


Figure 7-48. The upsampling factor n 


Keeping in mind that each individual temporal step is repeated twice, you would see 
something like Figure 7-49, Figure 7-50, and Figure 7-51. 


Ee. | 
/ 


42671695— 


Figure 7-49. The first entry in the input is repeated twice to form the first two 
entries in the upsampled output vector 


vere ; 
la 


(42671695— 


Figure 7-50. The next entry is repeated twice to form the next two entries in the 
output vector of the upsampling operation 


285 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 
4 422 6 6 ; 


(42671695— 


Figure 7-51. This process is repeated with the third entry in the input vector to 
form the next third pair of entries in the output vector 


And so on until you finally get Figure 7-52. 


(4422667711669955 


(42671695— 


Figure 7-52. The output vector after the upsampling operation compared to the 
original input vector below it 


Going back to the model, each upsampling layer is then connected to a one- 
dimensional convolutional layer, and the pair of upsampling layer and one-dimensional 
convolutional layer repeats again until the final output is passed through a softmax 
function to result in the output/prediction. 


Anomaly Detection with the ED-TON 


Let’s put this model to the test by applying it to the credit card dataset. Once again, this 
example is another instance of supervised learning, so you will have both anomalies 
and normal data labeled. 

First, begin by importing all of the necessary modules (see Figure 7-53). 


286 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


import numpy as np 

import pandas as pd 

import keras 

from keras import regularizers, optimizers 


from keras.layers import Inp ut, ConvlD, Dense, Flatten, Activation, 
UpSamplinglD, MaxPoolinglD, ZeroPadding1lD 


from keras.callbacks import ModelCheckpoint, TensorBoard 


from keras.models import Model, load model 


from keras.utils import to categorical 


from sklearn.model selection import train test split 


from sklearn.preprocessing.data import StandardScaler 





Figure 7-53. Importing the necessary modules 


Next, load your data and preprocess it. Notice that the steps are basically the same as 
in the first example (see Figure 7-54). 


287 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


dadf['Amount'] = 
standardscaler().~fit-transtorm (dt [*Amount"|] «values, reshape (=1 


df['Time'] = 
standardsScaler()«fit transtorm(df[ "Time" ].values.reshape (—1 


anomalies = df[df["Class"] == 1] 


normal = di [drt["Class*| == 0] 


for f in range(0, 20): 


normal = normal.iloc[np.random.permutation (len (normal) ) ] 


data_set = pd.concat([normal[:10000], anomalies] ) 


x train, x best = Liaein test Sspliri(data sec, Lest size = 0,4, 
random Stace = 42) 


x [train = x train.sort values (by=["Time’"]) 


x LESt = = tesl.sort values (by=["Time™] ) 


y.train = x traan(["Class™)] 


VY test = % test("Class”™ | 





Figure 7-54. Using the standard scaler on the columns Time and Amount, 
defining the anomaly and normal value data sets, and then defining a new data 
set to generate the training and testing sets from. Finally, these sets are sorted in 
increasing order of time 


And now you reshape the data sets as shown in Figure 7-55. 


x Crain = Nnp.array (x. train).reshape (x train. shape|0), 
x trainsshape[1j, 4) 


x test. = npwarreay (x test) .reshape(x testsshape!0), 
x tSet.shape(t)y. 1) 


input shape = (x _train.shape[1], 1) 


y tain = keras.utils.toe Categorical (y train, 2) 


y test = Kerasczutile.te categorically test, 2) 





Figure 7-55. Reshaping the training and testing sets so that they correspond with 
the input shape of the model 


288 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Now that the data preprocessing is done, let’s build the model. This is the encoding 


stage (see Figure 7-56). 


input layer = Input (shape=(input_ shape )) 


### ENCODING STAGE 


# Pairs of causal 1D convolutional layers and pooling layers 
comprising the encoding stage 


conv... = ComviD(f1 lrers=inc input sheapel0)), kernel size=—2, 
dil@elon rare=ly 


padding='causal', strides=1,input_shape=input_shape, 
kernel. regulars zer—reculerizereslZ(0v0l yy 


ecuivalson—"relu*) (snpul ayer) 


pool l MaxPoolanglD(pool size=2, Strides=2) (conv. 1} 


Conv. 2. = ConviD (ra ters=2ne (input. shape] § 2) kernel size=Z, 
dilation rate=l1, 


padding='causal',strides=1, 
kernel regqularltzer=reqularizers.1Z(0201), 


aceivarion=' relu) (pool 1) 


pool 2 MaxPooling!D (pool -S176=2; SLrides=3):(conv. Z) 


conv 3 = ConviD(f1lrers=i1nc (input shape [0] aie kernel Si 7e=2; 
dilation rare=l, 


padding='causal', 
Stridee=1,;kernel regqulerizer—reqularazers.t2(0.01), 


aCLivation="relu™) (pool .2) 


### OUTPUT OF ENCODING STAGE 


encoder = Dense (int (input shape) ] Jf ©) 4 activation="relu") (conv 3) 





Figure 7-56. Defining the code for the encoding stage 


289 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Following that block is the code for the decoding stage (see Figure 7-57). 


### DECODING STAGE 


# Pairs of upsampling and causal 1D convolutional layers comprising 
the decoding stage 


upsample 1 = UpSampling1D(size=3) (encoder) 


conv 4 = Conv1D(filters=int (input _shape[0]/3), kernel size=2, 
dilacvion: rate=1, 


padding='causal',strides=1, 
kernel requlearizer=neqularizers.ilZ2 (0.0L), 


activation='relu") (upsample 1) 
upsample 2 = UpSampling1D(size=2) (conv_4) 


Cony. 2: = Conv1D(filters=int (input _shape[0]/2), Kernel. SizeqZ,y 
dilation rate=1, 


padding='causal', 
Scrides=l,kernel ,eqularizer=reqularizersate (U.S), 


activation="relu”) (upsampile 2) 


zero pad 1 = ZeroPadding1D(padding=(0,1)) (conv 5) 


cony 6 =. ConvlD(tTilters=int (input shape [0)),. kernel size=2, 
dilacion rare=l, 


padding='causal', 
Sstrides=l,kernel teqularizer—reqularizers.,12(0.05), 


activation="relu") (zero pad 1) 


### Output of decoding stage flattened and passed through softmax to 
make predictions 


Plat = Flarren |) (oony.--6) 


output tayer = Dense (2, acuivation="Ssorttmax*) (tlat) 


TCN = Model (inputs=input layer, outputs=output layer) 





Figure 7-57. Code to define the decoding stage and then the final layer. The model 
is then initialized 


290 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Now that the model has been defined, let’s compile it and train it (see Figure 7-58). 


TCN compile (lLoss=kéeras: Losses.categorical Ccrossentropy, 
optimizer=optimizers.Adam(lr=0.002), 


metrics=["accuracy"™] ) 


checkpointer = ModelCheckpoint (filepath="model ED- 


TCN erediuccard.lo”,; 


verbose=0O, 


save best. only=True) 


TCN.summary () 





Figure 7-58. Compiling the model, defining the checkpoint callback, and calling 
the summary function 


The output should look somewhat like Figure 7-59. 


291 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Layer (type) Output Shape Param # 
input_18 (Inputhayer) -—=« (None, 31, 1)” . 
convid 101 (ConviD) (None, 31, 31) 93 

max poolingid 35 (MaxPooling (None, 15, 31) 0 
convid 102 (ConviD) (None, 15, 15) O45 

max poolingid 36 (MaxPooling (None, 5, 15) 0 
convid 103 (Conv1D) (None, 3S, 10) 310 
dense 32 (Dense) (None, 3, 3) she 

up samplingld 45 (UpSampling (None, 15, 5) 0 
convid 104 (ConvlD) (None, 15, 10) 110 

up samplingld 46 (UpSampling (None, 30, 10) 0 
convid 105 (ConviD) (None, 30, 15) 315 
zero paddingld 7 (ZeroPaddin (None, 31, 15) 0 
convid 106 (ConviD) (None, 31, 31) 961 
flatten 15 (Flatten) (None, 961) 0 
dense 33 (Dense) (None, 2) 1924 


Total params: 4,713 
Trainable params: 4,/13 
Non-trainable params: () 


Figure 7-59. The summary of the model. This can help you get an idea of how the 
encoding and decoding works by looking at the output shapes of each layer 


Notice the addition of the zero padding layer. What this layer does is add a 0 to the 


data sequence in order to help the dimensions match. Because the original data had an 


odd number of columns, the number of dimensions in the output of the decoder stage 


did not match the dimensions of the original data after being upsampled (this is because 
of rounding issues, since everything is an integer). To counter this, 


zero pad 1 = ZeroPadding1D(padding=(0,1))(conv_5) 


2O2, 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


was included, where the tuple is formatted as (left_pad, right_pad) to customize how 


the padding should be. Otherwise, passing in an integer will just pad on both ends. To 


summarize, zero padding will add a zero to each entry in the data to the left, right, or 
both (default) sides. 
With the model compiled, all that’s left for you to do is train the data (see Figure 7-60). 


TCN sELC(S Ciainy y Train, 


batch. size=123, 


epochs=25, 


verbose=l, 


Validation data=(x. test, y test), 


callbacks = 


[checkpointer] ) 


Figure 7-60. Training the data on the training sets 


After a while, you should end with something like Figure 7-61. 


Out (73): 


Epoch 7/25 
6295/6295 ee = eee } 
Epoch 8/25 


Epoch 9/25 

6295/6295 (===SS SSS SSS SS SSS SSS SSSSSSsSS==] 
Epoch 10/25 

6295/6295 [==============================) 
Epoch 11/25 


Epoch 13/25 
6295/6295 ( ===== === =$s$$sS==$=S5SS=======]) 
Epoch 14/25 
6295/6295 [==S=SSSSSSsSassSSsSSSsSSSSSSS==] 
Epoch 15/25 


Epoch 18/25 


6295/629S (==eeeeeeeeeeeeeereseeesseere== 


Epoch 19/25 


6295/6295 (SSSSSSSSssssssssssssssssssssa=] 
Epoch 20/25 
6295/6295 [H==HS= SSS S SSS H SSS S SSS SS SSSS====] 
Epoch 21/25 
6295/6295 [S=s=s=s=sssssSSSSSSSSSSSSSS======]) 


Epoch 22/25 


6295/6295 [=== EZESEESEEES SESS SSSSSSSSSSSS==) 


Epoch 23/25 


6295/6295 (Sttttsssssssssssssssssssssss==] 
Epoch 24/25 
6295/6295 [ SS SSS SSSS SSS SSSSSSSSSSSSSSS===] 
Epoch 25/25 
6295/6299 [===ssssssssssssssssssssssss==5] 


<keras.callbacks.History at 0x23907fc6ébe0> 


- Os 


- Os 


- Os 


- Os 


- is 


- Os 


- ils 


- lis 


- is 


- 0s 


- ls 


- is 


- Os 


- Os 


- is 


- is 


- Os 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


0. 


°o 


oO 


Qo 


o 


oO 


2 


=) 


oO 


°o 


o 


o 


oO 


Qo 


o 


°o 


Oo 


oO 


Oo 


0891 


0848 


0804 


0775 


.0749 


.0764 


0752 


0746 


.0733 


.0703 


0731 


-0688 


-0695 


0694 


0696 


.0723 


0687 


0665 


0657 


: 0.9889 


: 0.9895 


: 0.9900 


: 0.9908 


: 0.9911 


: 0.9913 


: 0.9903 


: 0.9911 


: 0.9916 


: 0.9906 


: 0.9906 


: 0.9909 


: 0.9916 


: 0.9905 


: 0.9909 


: 0.9898 


: 0.9909 


: 0.9913 


: 0.9916 


val_loss: 
val loss: 
val loss: 
val _ loss: 
val_loss: 
val loss: 
val loss: 
val_loss: 
val loss: 
val loss: 
val_loss: 
val_loss: 
val loss: 
val loss: 
val_loss: 
val _ loss: 
val loss: 
val_loss: 


val loss: 


2 


0909 


-0843 


-0827 


0806 


-0811 


0781 


0788 


0768 


-0826 


.0753 


-0741 


.0769 


.0754 


.0782 


0765 


.0739 


0710 


0703 


.0718 





val_acc: 
val_acc: 
val _ acc: 
val_acc: 
val_acc: 
val_acc: 
val _acc: 
val_acc: 
val_acc: 
val_ ace: 
val_acc: 
val_acc: 
val_acc: 
val _ acc: 
val_acc: 
val_acc: 
val _ acc: 
val_acc: 


val_acc: 


Figure 7-61. This output is similar to what you should see after the training 
process ends 


o 


o 


o 


Oo 


So 


o 


o 


Oo 


Oo 


o 


oO 


Oo 


Oo 


oO 


Oo 


Oo 


Oo 


Qo 


Oo 


«9907 


- 9902 


- 9907 


9907 


- 9905 


+9907 


- 9907 


+9907 


- 9905 


- 9907 


-9909 


- 9902 


-9909 


-9902 


- 9905 


+9905 


9912 


-99140. 


-9905 


293 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


Now evaluate your model’s performance (see Figure 7-62). 


score = TCN.evaluate(x test, y Test, verbose=0) 


print('Test loss:', score[0]) 


print('Test accuracy:', score[1]) 





Figure 7-62. Evaluates the model's performance in terms of loss and accuracy 


You should see an output similar to Figure 7-63. 


In [74]: 
score = TCN.evaluate(x test, y test, verbose=0) 
print('Test loss:"', score[0]) 
print ("Test accuracy:", score[i]) 


Test loss: 0.0717651191317761 
Test accuracy: 0.9904693828925423 


Figure 7-63. The generated outputs for loss and accuracy for the model when the 
test sets are passed in 


Pretty good, but how’s the AUC score? Run the code in Figure 7-64. 


from sklearn.metrics import roc auc score 


preds: = TCON.predicu(x best) 


auc -= TOC auc Score, tip.Fround (preds);, y rest) 


prance "AUC? t2«2eo}"<Llormac. (ac) ) 





Figure 7-64. Code to check the AUC score given the rounded predictions and the 
test sets 


The output should look somewhat like Figure 7-65. 


294 


CHAPTER 7 TEMPORAL CONVOLUTIONAL NETWORKS 


In [75]: from sklearn.metrics import roc auc score 


preds = TCN.predict(x test) 
auc = roc auc score( np.round(preds), y test) 
print("AUC: {:.2%}".format (auc) ) 


AUC: 98.64% 


Figure 7-65. The generated AUC score 


That’s a nice AUC score! So for both the encoder-decoder TCN and dilated TCN 
architectures, you’ve managed to attain AUC scores of over 98% on the credit card 
data set in a supervised setting. Although both models trained and performed in a 
supervised setting, since the anomalies and the normal entries were labeled as such, 
the key takeaway is that TCNs are incredibly quick to train with GPUs and can perform 


really well. 


Summary 


In this chapter, we discussed temporal convolutional networks and showed how they 
fare when applied to anomaly detection. 
In the next chapter, we will look at practical use case of anomaly detection. 


295 


CHAPTER 8 


Practical Use Cases 
of Anomaly Detection 


In this chapter, you will learn how anomaly detection can be used in several industry 

verticals. You will explore how anomaly detection techniques can be used to address 

practical use cases and address real-life problems in the business landscape. Every 

business and use case is different, so while we cannot copy-paste code to build a 

successful model to detect anomalies in any dataset, this chapter will cover many use 

cases to give an idea of the possibilities and concepts behind the thought processes. 
In a nutshell, the following topics will be covered throughout this chapter: 


e What is anomaly detection? 
e Real-world use cases of anomaly detection 
e Telecom 
e Banking 
e Environmental 
e Healthcare 
e Transportation 
e Social Media 
e Finance and Insurance 
e Cybersecurity 
e Video Surveillance 


e Manufacturing 


© Sridhar Alla, Suman Kalyan Adari 2019 
S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5_8 


204 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


e Smart Homes 
e Retail 


e Implementation of deep learning-based anomaly detection 


Anomaly Detection 


Anomaly detection is finding patterns that do not adhere to what is considered as 
normal or expected behavior. Businesses can lose millions of dollars due to abnormal 
events. Consumers can also lose millions of dollars. In fact, there are many situations 
every day where people's lives are at risk and where their property is at risk. If your 
bank account gets cleaned out, that’s a problem. If your water line breaks, flooding 
your basement, that’s a problem. If all flights get delayed, that’s a problem. You might 
have been misdiagnosed or not diagnosed at all with a health issue, which is a very big 
problem that directly impacts your well-being. 

Figure 8-1 is an example of an anomaly showing a rainbow-colored fish in the 
blueish fish family. 


The Fish Family Portrait 


@. 
® 





Figure 8-1. An example of an anomaly 


298 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


In business use cases, everything is centered around data, and anomaly detection is 
the identification of abnormal data points, events, or observations that raise suspicions 
due to the fact that they differ significantly from the data perceived as normal or typical. 
Many such anomalies can impact the business operations or bottom lines significantly, 
which is why anomaly detection is gaining a lot of traction in certain industries and 
many businesses are investing heavily in technologies that can help them identify 
abnormal behavior before it is too late. Such proactive anomaly detection is becoming 
more and more visible, and due to the new technologies developed as part of the Al 
revolution, this problem is also getting solved in ways never possible before. 

Figure 8-2 is an example of the daily number of cars that cross the Golden Gate 


Bridge in San Francisco. 


a 


Cars Crossing the Golden Gate Bridge 





Jan Feb Mar Apr May Jun Jul 


Figure 8-2. Daily count of cars crossing 


The kind of anomaly detection that can potentially help businesses depends very 
much on the kind of data collected as part of the business operations and the kind of 
techniques and algorithms used as part of the strategy to perform the anomaly detection. 


Real-World Use Cases of Anomaly Detection 


We will look at several industry verticals and businesses, and how anomaly detection can 


be used. 


209 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Telecom 


In the telecom sector, some of the use cases for anomaly detection are to detect roaming 
abuse, revenue fraud, and service disruptions. So how do we detect roaming abuse in the 
telecom sector? By looking at the location of the cellular devices, we can categorize the 
kind of behavior of the cellular device at any particular moment as normal or abnormal. 
This helps us detect cellular device usage at that period of time. By looking at all of 

the other information we know in general about roaming activity, we can also detect 
how this cellular device is being used and whether any roaming abuse is taking place. 
Figure 8-3 shows how roaming works for your phone as you travel around the world. 





Figure 8-3. Roaming 


Service disruption is another very high impact use case for anomaly detection. 
Cellular devices are connected to cellular networks via towers, which are all over the 
place. Your cell phone connects to the nearest tower in order to participate in the cellular 
network. In case of events involving large crowds such as a concert or a football game, 
the cellular towers that typically perform quite well get heavily overloaded, causing 
serious service disruptions and very bad customer experience for the duration of the 
overload. Figure 8-4 shows a service disruption of phone service in the northwestern 
United States. 


300 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 





Figure 8-4. Service disruptions 


If we know the various metrics of the cell phone towers and the associated devices 
at some period of time and for a long duration, along with any kind of information 
we have on the typical nature of activity around the towers in terms of whether there 
were concerts or games in the vicinity or a major event is expected in the vicinity of 
the cellular towers, we can use a time series as a basis to represent all such activity and 
subsequently use TCN or LSTM algorithms to detect anomalies pertaining to the major 
events because they have a temporal dependency. This will help in looking at how 
these services are being used and how effective the service is for the particular cell 
phone towers. 

The cell phone companies now have a way of understanding whether certain hours 
need to be upgraded or more towers need to be built. For instance, if major office 
buildings are being built near a particular tower, using data on the time series of all the 
towers owned by the cellular network, it is possible to detect anomalies in other parts of 
the network and apply the principles to the tower that is probably going to be impacted 
by the newly constructed office buildings (which will add thousands of cell phone 
connections and could cause overloading on the tower and affect how the tower will be 
used in the near future). 


301 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Banking 


In the banking sector, some of the use cases for anomaly detection are to flag abnormally 
high transactions, fraudulent activity, phishing attacks, etc. Credit cards are used by 
almost everyone in the world, and typically every individual has a certain way of using 
their credit card, which is different from everyone else. So there is an implicit profile of 
the individual using the credit card in terms of how they use it, when they use it, why 
they use it, and what did they use it for. If the credit card company has such information 
about the credit card usage of very large number of consumers, it is possible to use 
anomaly detection to detect when a specific credit card transaction may be fraudulent. 

Autoencoders are very useful in such an anomaly detection use case. With such a 
case, we can take all the credit card transactions by individual consumers, and capture 
and convert the features into numerical features such that we can assign certain scores 
to every credit card based on various factors along with a kind of indicator as to whether 
the transaction are normal or abnormal. Then, using autoencoders, we can build an 
anomaly detection model that can quickly determine a specific transaction as normal or 
abnormal given everything we know about all the other transactions for a customer. The 
autoencoder does not even need to be extremely complicated; it can be built with just a 
few hidden layers for the encoder and a few hidden layers for the decoder and still have 
pretty decent detection of abnormal activity (otherwise known as fraudulent activity) on 
the credit cards. Figure 8-5 is a depiction of credit card fraud. 





Figure 8-5. Depiction of credit card fraud 


302 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Environmental 


When it comes to environmental aspects, anomaly detection has several applicable 
use cases. Whether it is deforestation or melting of glaciers, air quality or water quality, 
anomaly detection can help in identifying abnormal activities. Figure 8-6 is a photo of 
deforestation. 





Figure 8-6. Deforestation 
Source: commons. Wwikimedia.org 


Let’s look at an example of the air quality index. The air quality index provides 
some kind of measurement of breathable air quality, which can be measured by using 
various sensors placed at various locations in the region. These sensors measure and 
send periodic data to be collected by a centralized system where such data is collected 
from all of the sensors. This becomes a time series, with each measurement consisting 
of several attributes or features. With each point in time having a certain number of 
features, which can then be input into a neural network such as an autoencoder, we can 
build an anomaly detector. Of course, we can use a LSTM or even TCN to do the same. 
Figure 8-7 shows the air quality index in Seoul in 2015. 


303 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


2015 PM2.5 Air Quality Index in Seoul 


250 
200 
150 
100 


50 


—_ -* —— ey See get eee +— = = 
EPEuE Teale fs SoSoe eee ie 
ss tii. F. eas Pett oe: — to te F, 


Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 





Figure 8-7. Air quality index 
Source: commons. Wwikimedia.org 


Healthcare 


Healthcare is one of the domains that can benefit a lot from anomaly detection, whether 
it is to prevent fraud, detect cancer or chronic illness, improve ambulatory services, etc. 

One of the biggest use cases for anomaly detection in healthcare is to detect 
cancer from various diagnostic reports even before there are any significant symptoms 
that might indicate the presence of cancer. This is extremely important given the 
serious consequences of cancer for any person. Some of the techniques in anomaly 
detection that we can use here involve convolutional neural networks combined with 
autoencoders. 

Convolutional neural networks use the concept of dimensionality reduction to 
reduce the large number of features/pixels with colors into much lower dimensionality 
points using the neural networks layers. So, if we combine this convolutional neural 
network with autoencoders, we can also use autoencoders to look at images such as MRI 
images, mammograms, or other images from diagnostic technologies in the healthcare 


industry. Figure 8-8 is a set of images from a CT scan. 


304 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 





Figure 8-8. CT scan images 
Source: commons. Wikimedia.org 


Let’s look at another use case of detecting abnormal health conditions of 
residents of a particular neighborhood. Typically, local hospitals are used by residents 
of specific neighborhoods. Using such data, the hospital can collect and store various 
kinds of health metrics from all the residents in this neighborhood. Some of the 
possible metrics are blood test results, lipid profiles, glycemic values, blood pressure, 
ECG, etc. When combined with demographic data such as age, sex, health conditions, 
etc., this information potentially allows us to build a sophisticated Al-based anomaly 
detection model. 

Figure 8-9 shows different health issues observed by looking at ECG results. 


305 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Note how half of the P waves 
are not followed by the ORS 
complex and T waves while the 
other half are. 

Question: What would you 
expect to happen to heart rate 
(pulse)? 


Second-degree (partial) block 


Note the abnormal electrical 
pattern prior to the ORS 


complexes. Also note how the 
frequency between the QRS 
complexes has increased. 
Question: What would you 
expect to happen to heart rate 


ulse)? 
Atrial fibrillation (pulse) 


Note the unusual shape of the 
ORS complex, focusing on the 
“S" component. 

Question: What would you 
expect to happen to heart rate 
(pulse)? 


Ventricular tachycardia 


Note the total lack of normal 
electrical activity. 
Question: What would you 


expect to happen to heart 
rate (pulse)? 


Ventricular fibrillation 


Note that in a third-degree block 
some of the impulses initiated by 


the SA node do not reach the 

AV node while others do. Also note 
that the P waves are not followed 
by the QRS complex. 

Question: What would you expect 


Third-degree block to happen to heart rate (pulse)? 


Figure 8-9. ECG results 
Source: commons.Wwikimedia.org 


There are a lot of different use cases in healthcare where we can use different 


anomaly detection algorithms to implement preventative measures. 


Transportation 


In the transportation sector, anomaly detection can be used to ensure proper 
functioning of the roadways and vehicles. If we can collect different types of events from 
all the sensors that are operational on the roadways such as toll booths, traffic lights, 
security cameras, and GPS signals, we can build an anomaly detection engine that we 
can then use to detect abnormal traffic patterns. 

Anomaly detection can also be used to look at times in schedules of public 
transportation and the related traffic conditions in the similar area of transportation. 


306 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


We can also look for abnormal activity in terms of fuel consumption, number of 
passengers the public transportation is supporting, seasonal trends, etc. Figure 8-10 is an 
image of a traffic jam due to peak time unexpected traffic. 





Figure 8-10. Traffic jam 


Social Media 


In social media platforms such as Twitter, Facebook, and Instagram, anomaly detection 
can be used to detect hacked accounts spamming everyone, false advertisements, fake 
reviews, etc. Social media platforms are used extensively by billions of people, so the 
amount of activity on social media platforms is extremely high and is ever growing. In 
order to ensure the privacy of the individuals using the social media platforms as well 
as to ensure the proper experience for each and every individual using the social media 
platforms, there are many techniques that can be used to enhance the capabilities of this 
system. Using anomaly detection, every individual activity can be examined for normal 
and abnormal behavior. 

Similarly, any advertising platforms ads, any personalized friend recommendations, 


any news articles that the individual might have been interested in, such as elections, 


307 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


can be processed for abnormal or anomalous activity. It would be a great use case 

for anomaly detection if anomaly detection could detect troll activity on your tweets, 
propagandized bots, fake news, and so on. Anomaly detection can also be used to 
detect if your account has been taken over, because all of a sudden your account might 
be posting an immense amount of tweets, pause tweets, and comments, or might be 
trolling other accounts and spamming everyone else. Figure 8-11 shows an article on 
fake news on Facebook. 


Facebook Admits Fake News Algorithm 


JOnn MOS | @JohiMannes 


mn 
JGBOREeEOW 








In the aftermath of the U.S. Presidential election, Facebook founder Mark Zuckerberg 
admits that a joke at Facebook “may have gotten out of hand” during the election, as what 
he terms a “playful algorithm experiment in the back office" generated dozens of fake news 
in millions of feeds. “Are we sorry?” Zuckerberg asked, but only rhetorically. “Sort of." 


éuckerberg insisted that anyone who read their news only through their Facebook feeds is 
“not playing with a full deck” although he added that he still hopes they don't mind being 
bombarded with targeted advertising. There a fine line between news and advertising, he 


Figure 8-11. Fake news on Facebook 


Finance and Insurance 


In the finance and insurance industries, anomaly detection can be used to detect 
fraudulent claims, fraudulent transactions such as transfer of money in and out of the 
country, fraudulent travel expenses, and the risk associated with the specific policy or 
individual, etc. The finance and insurance industries depend on the ability to target 
the right consumers and take the right amount of risk when dealing with finance and 


308 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


insurance. For instance, if they already know that a specific area is prone to forest fires 
or earthquakes or very frequent flooding, the insurance company insuring your home 
needs to have all the tools that they can get their hands on to quantify the amount of risk 
involved when writing the policy for homeowner insurance. 

Anomaly detection can also be used to detect wire fraud where a large amount 
of money is transferred in and out of the country using several different accounts, 
something extremely difficult for human eyes to manually glance over and figure out 
considering the massive volume of transactions that can take place every hour. This is 
feasible because AI techniques can be trained on very large amounts of data to detect 
very new and innovative wire fraud beyond the capabilities of any human or many of the 
statistical techniques that have been in place for decades. Deep learning does solve a very 
big problem in the financial and insurance industries, and with the advent of graphical 
processing units (GPUs), this is becoming a reality in many of the very hard-to-crack use 
cases. Anomaly detection and deep learning can be used together in order to serve the 
needs of the business. Figure 8-12 shows the mortgage loan fraud reporting trend. 


Mortgage Loan Fraud Reporting Trend 


30,000 
25,000 
20,000 + 
15,000 
10,000 4 
5,000 4 





1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 
@ Actual O Projected 


Figure 8-12. Mortgage loan fraud reporting trend 


Cybersecurity 


Another use case for anomaly detection is in cybersecurity or networking. In fact, one of 
the very first use cases for anomaly detection was decades ago when just the statistical 
models were being used to try to detect any intrusion attempts into networks. In the 
cybersecurity space, there are many things that can happen. One of the most prevalent 


309 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


attacks is a denial of service (DOS) attack. When a denial of service attack is launched 
against your company’s website or portal so as to disrupt service to your customers, 
typically a large number of machines are mobilized to run simultaneous connections and 
random useless transactions against your portal (which is probably dealing with some 
kind of a payment service for customers). As a result, the portal isn’t responsive to the 
customers, eventually leading to very poor customer experience and a loss of business. 
Anomaly detection can detect the anomalous activity since we're training the system 
on data that has been collected for a long period of time. This data is comprised of 
typical use behavior, patterns in payment, how many users are active, and how much 
the payment is at this particular time, as well as seasonal behaviors and other trends 
that exist for the payment portal. When a DOS attack is suddenly launched against your 
payment portal, it is very possible for your anomaly detection algorithm to detect such 
activity and quickly notify the infrastructure or operational teams who can take corrective 
action such as setting up different firewall rules or better routing rules that attempt to 
block the anomalous or bad actors from launching the attack or prolonging the attack 


against the portal. Figure 8-13 is example of anomaly monitoring network flows. 





sie 12-00 re a ak a a Le 
15. Sep 2°00 16. Sep 12:00 17. Sep 12°00 


Figure 8-13. Anomaly monitoring network flows 


Another example is when hackers try to get into a system given that they were 
somehow able to set up a Trojan to get into the network in the first place. Typically, this 
process involves a lot of scanning, such as port or IP scanning, to see what machines exist 
in the network when the services are being run. The machines may be running SSH and 


telnet (which is easier to crack), and the hacker may try to launch several different types 


310 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


of attacks that exploit the vulnerabilities of the telnet or asset service. Eventually, one of 
the targeted machines will respond, and the hacker will get into the system and continue 
the penetration of the internal network until they accomplish what they came for. 

Typically, networks have a pattern of usage, and there are database servers, web 
servers, development servers, payroll systems, QA systems, and end user-facing systems. 
Usually the well-known, expected behavior is seen for a long period of time. Then 
there is a change that is observed and expected over a long period of time as to how the 
machines are used as well as how the networks are used. We can also measure the ways 
machines talk to each other and via which service/ports. 

Using anomaly detection, we can detect if a specific port or service on a specific 
machine or machines is being connected to or transacted with at an abnormal rate, 
meaning that there is some kind of intrusion activity taking place where some intruder is 
trying to hack into the specific system or systems. This is extremely valuable information 
to the operations team, who can quickly pull in the cybersecurity experts and try to drill 
down into what is really going on and take any kind of preventive or proactive action 
rather than reactivate. This could be the difference between the business staying afloat 
or the business shutting down (at least temporarily). There have been instances where 
a single cyber security intrusion almost bankrupted a business, costing hundreds of 
millions of dollars in damages. This is the reason why the cybersecurity domain is 
very interested in deep learning, and the use cases that involve deep learning anomaly 
detection are some of the top use cases in the cyber security and networking space in 
this day and age. Figure 8-14 shows an anomaly in the number of TCP connections on 
different service ports. 





0.00% 


x = Z " ith. 
2008 2006 2007 2008 #2009 #2010 2011 2012 2013 2014 #2015 2016 2017 2018 2019 


Figure 8-14. TCP connections over service ports 


311 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Not all the use cases are doom and gloom in cyber security or networking; anomaly 
detection can also be involved in determining whether we need to upgrade some of the 
systems, whether our systems are able to sustain the traffic for now and in the future, 
whether any node capacity planning needs to take place to bring everything back 
to normal, and so on. This is again very important for the operations team so it can 
understand if there are trends which were not foreseen a year ago that are now affecting 
the normal to abnormal behavior of the network. It is very important to know right now 
rather than later when it is too late and to start proactively planning to deal with this 
origin traffic or transactions that are happening in our network against some specific 


machine or machines. 


Video Surveillance 


Another domain where anomaly detection is becoming extremely important is 
video surveillance. Nowadays, it is very common to see security cameras and video 
surveillance systems no matter where you go: a local school, a local park, Main Street, 
near a neighbor’s house, or in your own house. The point is, video surveillance is here 
to stay. Given all the new technological advancements in smart apps and smartphones, 
this is definitely not going to change any time soon. Rather, we should expect much 
more video surveillance. In the very near future, we will see lot more smart cars and 
self-driving cars. They also depend on continuous processing of video using real-time 
analysis and detecting various objects. At the same time, they can also detect any kind 
of anomaly. In a strictly security video surveillance sense, anomaly detection can be 
used to detect the normal for the specific camera that is looking at your backyard. When 
a specific anomaly is detected because of some kind of motion within the vicinity of 
your house, such as a wild animal or even an intruder walking on your lawn, your home 
security system is able to see that this is not normal. In order for the cameras to do this 
effectively, the manufacturers train very sophisticated machine learning models to 
assess the video signals in real time. The feed coming from the cameras is determined 
as normal or abnormal. For example, if you are driving in a self-driving car on the 
interstate, video of the car will clearly indicate what is normal right now according to 
how the road should look, where the signs should be, where the trees should be, and 
where the next car should be. Using anomaly detection, self-driving cars can avoid any 
abnormalities happening on the path and then take corrective action before anything 
bad can happen. 

Figure 8-15 is an object-detecting video surveillance system. 


312 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 





Figure 8-15. Object-detecting video surveillance system 


Manufacturing 


Anomaly detection is also being used heavily in the manufacturing sector. Specifically, 
since most of the manufacturing nowadays involves robots and a lot of automation, 
anomaly detection can be used to detect malfunctions or impending failure of parts of 
the manufacturing system. 

In the manufacturing industry, because of all the automation that is happening, 
there is a lot of emphasis on various kinds of sensors and other types of metrics 
being collected in a real-time or near real-time basis. This data can be used to build 
a sophisticated anomaly detection model to try to detect if there is any impending 
problem that will be seen very soon in the plant or the manufacturing cycle. 

Another example of anomaly detection and how it can be used in business is the case 
of oil and natural gas platforms. An oil and natural gas platform typically has thousands 
of components all interconnected in various ways in order to make the plant functional. 
Needless to say, all the components can be monitored using sensors that do specific 
measurements of the various parameters of the components to which the sensors are 
attached to. All these sensors can be part of an IoT (Internet of Things) platform. If you 


313 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


can collect all the sensor output from the tens of thousands of sensors attached to the 
tens of thousands of components, then it becomes possible for us to collect such data 
for a longer period of time and train sophisticated anomaly detection models such as 
autoencoders, LSTMs, and TCNs. 

Figure 8-16 shows a manufacturing plant with sensor readings. 





i 
i 
' 
' 
' 
i 
Peseantieeceerpasssccecceass : 
j 
i 
i] 
' 
i) 
I 


spectral power 


] ' 
ee ee 
eS 1 

i 

i 

1 

1 

’ 

1 

* 

4 

+ 

. 

a 

’ 
=-----|-------}------4--- 


t 

] 

fi 

i 

i 

t 

t 
= 

i 

i] 

t 

t 

t 

i 

i 
ate 

' 

t 

a 

t 

1 
es 





; 350 {00 450 500 550 600 650 FOO 
Wavelength (nm) 


Figure 8-16. Manufacturing plant with sensor readings 


314 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Smart Home 


Another kind of business that is also using anomaly detection to its advantage is the 
smart home system. Smart homes have lots of integrated components, such as smart 
thermostats, refrigerators, and interconnected devices, that all talk to each other. 
Let’s say you have an Amazon Alexa. Alexa can talk to your smart lights, which use 
smart bulbs. All components can use a very smart app on your smart phone. Even 
thermostats are interconnected. So how do we really use anomaly detection in this 
use case? A simple way is to monitor how you set your thermostat for the optimal 
temperature during all weather conditions and follow some sort of recommendation 
or recommended behavior. Because the thermostats are personalized to some extent 
in each household, there may be a very good deep learning algorithm out there that is 
continuously looking for the thermostats across all houses, including yours, and can 
then detect how you use it normally. Figure 8-17 is an illustration of a smart home. 





Smart Home 


» ® 
®D ——. 








Figure 8-17. A smart home 


Retail 


Another big industry that uses anomaly detection algorithms is the retail industry. In 
the retail industry, there are certain use cases such as the efficiency of the supply chain 
in terms of distribution of goods and services. Also interesting are the returns from 
customers because returned goods are tricky: sometimes it costs less to sell them in a 
clearance sale than to restock. 


315 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


Looking at customer sales is also critical both in terms of revenue generated by 
sales and in terms of planning future products and sales strategies, especially when it 
comes to targeting the consumers better. Figure 8-18 shows the historical sales figures 
of a product. 


6000 
5500 
5000 
4500 
4000 


3500 





3000 
2010 2020 


Figure 8-18. Historical sales figures of a product 


Implementation of Deep Learning-Based Anomaly 
Detection 


Given these use cases in these different industries, what are the key steps in establishing 
an anomaly detection practice in your organization or business? 


The key steps involved in anomaly detection are as follows: 
e Identifying business use case and getting aligned on the expectations 


e Defining what data is available and understanding it and the nature 
of the data itself 


e Establishing the processes to consume the data in order to process it 
e Establishing the type of models to use 


e A strategic discussion of how the models will be used and executed 


316 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


e Investigating the results and feedback analysis as it effects the 


business 


e Operationalizing the model used in the day-to-day activity of the 


business 


In particular, we are very interested in how the models are built and in what type 
of models we should be using. The type of anomaly detection algorithm used affects 
pretty much everything that we are trying to get out of this anomaly detection strategy. 
This in turn depends on the type of data available, as well as whether the data is already 
labeled or identified. One of the things that will affect the decision to figure out what 
type of anomaly detection will work best for the specific use case is whether it is a 
point anomaly, contextual anomaly, or a collective anomaly. We are also interested in 
looking at whether the data is an instantaneous snapshot at some point in time or if it 
is continuously evolving or ever-changing, real-time, time series data. Also important 
is whether the specific features or attributes of the data are categorical or numerical, 
nominal, ordinal, binary, discrete, or continuous. It is also very important to know if the 
data is being labeled already or if some sort of a hint is provided as to what this data is, 
since it could steer us in the direction of supervised, semi-supervised, or unsupervised 
algorithms. 

While the technologies and algorithms are available to be used, there are several key 
challenges to implementing an anomaly detection approach based on deep learning: 


e It’s hard to integrate AI into existing processes and systems. 
e The technologies and the expertise needed are expensive. 
e Leadership needs to be educated on what AI can and cannot do. 


e Al algorithms are not natively intelligent; rather, they learn by 
analyzing “good” data. 


e There is aneed for change in “culture,” especially in large companies. 


Summary 


In this chapter, we discussed practical use cases of anomaly detection in the business 
landscape. We showed how anomaly detection can be used to address real-life problems 
in many businesses. Every business and use case is different, so while we cannot copy/ 


317 


CHAPTER 8 PRACTICAL USE CASES OF ANOMALY DETECTION 


paste code to build a successful model to detect anomalies in any dataset, this chapter 
covered many use cases to give you an idea of the possibilities and concepts behind the 
thought process. 

Remember that this is an evolving field with continuous inventions and 
enhancements to the algorithms present, which means that in the future the 
algorithms will not look the same. Just couple of years ago, the RNN (recurrent neural 
network) was the best algorithm for a time series, but now the LSTM (Chapter 6) is 
being used heavily and the TCN (Chapter 7) will be the future of dealing with a time 
series. Even autoencoders have changed quite a bit; the traditional autoencoders have 
evolved into variational autoencoders (Chapter 4). The RBM (Chapter 5) is not used 
that much any longer. 

In the next chapter, Appendix A, we will look at Keras, which is a popular framework 


for deep learning. 


318 


APPENDIX A 


Intro to Keras 


In this appendix, you will be introduced to the Keras framework along with the 

functionality that it offers. You will also take a look at using the back end, which is 

TensorFlow in this case, to perform low-level operations all using Keras. 
Regarding the setup, we use 


e tensorflow-gpu version 1.10.0 

e keras version 2.0.8 

e torch version 0.4.1 (this is PyTorch) 
e CUDA version 9.0.176 


e cuDNN version 7.3.0.29 


What Is Keras? 


Keras is a high-level, deep learning library for Python, running with TensorFlow, CNTK, 
or Theanos as the back end. The back end can basically be thought of as the “engine” 
that does all of the work, and Keras is the rest of the car, including the software that 
interfaces with the engine. 

In other words, Keras being high-level means that it abstracts away much of the 
intricacies of a framework like TensorFlow. You only need to write a few lines of code 
to have a deep learning model ready to train and ready to use. In contrast, TensorFlow 
being more of a low-level framework means you have much more added syntax and 
functionality to define the extra work that Keras abstracts away for you. At the same 
time, TensorFlow and PyTorch also allow for much more flexibility if you know what 
you re doing. 


319 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5 


APPENDIXA INTRO TO KERAS 


TensorFlow and PyTorch allow you to manipulate individual tensors (similar to 
matrices, but they aren’t limited to two dimensions; they can range from vectors to 
matrices to n-dimensional objects) to create custom neural network layers, and to create 
new neural network architectures that include custom layers. 

With that being said, Keras allows you to do the same things as TensorFlow and 
PyTorch do, but you will have to import the back end itself (which in this case is 
TensorFlow) to perform any of the low-level operations. This is basically the same thing 
as working with TensorFlow itself since you’re using the TensorFlow syntax through 
Keras, so you still need to be knowledgeable about TensorFlow syntax and functionality. 

In the end, if you’re not doing research work that requires you to create a new type 
of model, or to manipulate the tensors directly, simply use Keras. It’s a much easier 
framework to use for beginners, and it will go a long way until you become sufficiently 
advanced enough that you need the low-level functionality that TensorFlow or PyTorch 
offers. And even then, you can still use TensorFlow (or whatever back end you're using) 
through Keras if you need to do any low-level work. One thing to note is that Keras has 
actually been integrated into TensorFlow, so you can access Keras through TensorFlow 
itself, but for the purpose of this appendix, we will use the Keras API to showcase the 
Keras functionality, and the TensorFlow back end through Keras to demonstrate the low- 
level operations that are analogous to PyTorch. 


Using Keras 


When using Keras, you will most likely import the necessary packages, load the data, 
process it, and then pass it into the model. In this section, we will cover model creation 
in Keras, the different layers available, several submodules of Keras, and how to use the 
back end to perform tensor operations. 

If you'd like to learn Keras even more in depth, feel free to check out the official 
documentation. We only cover the basic essentials that you need to know about Keras, so 
if you have further questions or would like to learn more, we recommend you to explore 
the documentation. 

For details on implementation, Keras is available on GitHub at https: //github. 
com/keras-team/keras/tree/c2e36f369b411ad1d0a40ac096fe35f73b9dFfd3. 

The official documentation is available at https: //keras.io/. 


320 


APPENDIXA INTRO TO KERAS 


Model Creation 


In Keras, you can build a sequential model, or a functional model. 
The sequential model is built as shown in Figure A-1. 


In [2]: l ### Sequential model 


import keras 
from keras.models import Sequential 
from keras.layers import Dense 


seq model = Sequential () 

seq model.add(Dense(16, input _shape=(8,))) 

seq model.add(Dense(32, activation="relu')) 
seq model.add(Dense(16, activation='"softmax')) 


Figure A-1. Code defining a sequential model in Keras 


Once you've defined a sequential model, you can simply add layers to it by calling 
model name.add(), where the layer itself is the parameter. Once you’ve finished adding 
all of the layers that you want, you are ready to compile and train the model on whatever 
data you have. 

Now, let’s look at the functional model, the format of which is what you've used in 
the book thus far (see Figure A-2). 


In [5]: l  ### Functional model 
3 import keras 
4 from keras.models import Model 
from keras.layers import Input, Dense 
input layer = Input (shape=(6,)) 
dense 1 = Dense (32, activation='relu') (input_layer) 


output layer = Dense(i6, activation="softmax") (dense 1) 


func model = Model(input_layer, output_layer) 
Figure A-2. Code defining a functional model in Keras 


The functional model allows you to have more flexibility in how you define your 
neural network. With it, you can connect layers to any other layer that you want, instead 
of being limited to just the previous layer like in the sequential model. This allows you 
to share a layer with multiple other layers or even reuse the same layer, allowing you to 


create more complicated models. 


321 


APPENDIXA INTRO TO KERAS 


Once you're done defining all of you layers, you simply need to call Model() with 
your input and output parameters respectively to finish your whole model. Now, you can 


continue onwards to compiling and training your model. 


Model Compilation and Training 


In most cases, the code to compile your model will look something like Figure A-3. 


In [. i: model.compile (optimizer="", 


=r rt 


loss="", 
metrics—="") 


Figure A-3. Code to compile a model in Keras 


However, there are many more parameters to consider: 


e optimizer: Passes in the name of the optimizer in the string or an 
instance of the optimizer (you call the optimizer with whatever 
parameters you like. We will elaborate on this further below in the 


Optimizers section. ) 


e loss: Passes in the name of the loss function or the function itself. 
We elaborate on what we mean by this below in the Losses section. 


e metrics: Passes in the list of metrics that you want the model to 
evaluate during the training and testing processes. Check out the 


Metrics section for more details on what metrics you can use. 


e loss_weights: If you have multiple outputs and multiple losses, the 
model evaluates based on the total loss. The loss_weights are a list 
or dictionary that determines how much each loss factors into the 
overall, combined loss. With the new weights, the overall loss is now 
the weighted sum of all losses. 


e sample_weight_mode: If your data has 2D weights with timestep- 
wise sample weighting, then you should pass in “temporal”. 
Otherwise, None defaults to 1D sample-wise weights. You can also 
pass a list or dictionary of sample_weight_modes if your model has 
multiple outputs. One thing to note is that you need at least a 3D 


output, with one dimension being time. 


322 


APPENDIXA INTRO TO KERAS 


e weighted_metrics: A list of metrics for the model to evaluate and 
weight using sample_weight or class_weight during the training and 


testing Processes. 


After compiling the model, you can also call a function to save your model as in 


Figure A-4. 


im tf ts from keras.callbacks import ModelCheckpoint 
checkpointer = ModelCheckpoint (filepath="saved model.h5", 


verbose=-0, 
save_best_only=True) 


Figure A-4. A callback to save the model to some file path 


Here are the set of parameters associated with ModelCheckpoint(): 


e filepath: The path where you want to save the model file. Typing just 
“saved_model.h5” saves it in the same directory. 


e monitor: The quantity that you want the model to monitor. By 


default, it’s set to “val_loss” 
e verbose: Sets verbosity to 0 or 1. It’s set to 0 by default. 


e save_best_only: If set to True, then the model with the best 
performance according to the quantity monitored will be saved. 


e save_weights_only: If set to True, then only the weights will be 
saved. Essentially, if True, model.save_weights(filepath), else model. 


save(filepath). 


e mode: Can choose between auto, min, or max. If save best only is 
True, then you should pick a choice that would suit the monitored 
quantity best. If you chose val_acc for monitor, then you want to pick 
max for mode, and if you choose val_loss for monitor, pick min for 


mode. 
e period: How many epochs there are between each checkpoint. 


Now, you can train your model using code similar to Figure A-5. 


323 


APPENDIXA INTRO TO KERAS 


In [ ]: 1 model.fit(x, y, 
Z batch size=126, 
epochs=25, 


verbose=1, 
validation data=(x t, y t), 
callbacks = [checkpointer] ) 


Figure A-5. Code to train the model 


The model. fit() function has a big list of parameters: 


e x: This isa Numpy array representing the training data. If you have 
multiple inputs, then this is a list of Numpy arrays that are all training 
data. 


e y: This isa Numpy array that represents the target or label data. 
Again, if you have multiple outputs, then this is a list of target data 
Numpy arrays. 


e batch_size: Set to 32 by default. This is the integer number of 
samples to run through the network before updating the gradients. 


e epochs: An integer value dictating how many iterations for the entire 
x and y data to pass through the network. 


e verbose: 0 makes it train without outputting anything, 1 shows a 
progress bar and the metrics, and 2 shows one line per epoch. Check 
the figures below for exactly what each value does: 


Verbosity 1 (Figure A-6) 


In [16]: TCN.£it(x_train, y train, 
: batch size=128, 
epochs=25, 
verbose=1, 
validation data=(x test, y test), 
callbacks = [(checkpointer]) 


Train on 6295 samples, validate on 4197 samples 





Epoch 1/25 

6298/6298 (==sSsssssssssssssssssssass====) - Os - loss: 0.0620 - acc: 0.9911 - val loss: 0.0641 - val_ace: 0.9900 
Epoch 2/25 

6295/6295 [s=sssssssssssssssssssssssesas==) - Os - loss: 0.0656 - acc: 0.9895 - val_loss: 0.0655 - val_acc: 0.9890 
Epoch 3/25 

6295/6295 (=== SESSSssssssssssssssssss==)] - 0s - loss: 0.0622 - ace: 0.9905 - val_loss: 0.0630 - val_ace: 0.9907 
Epoch 4/25 

3712/6298 [S=SS=SSSSSS=====>. ee eee ee ] - ETA: Os - loss: 0.0637 - ace: 0.9903 


Figure A-6. The training function with verbosity I 


324 


APPENDIXA INTRO TO KERAS 


Verbosity 2 (Figure A-7) 


In 


[i7]: TCN.fit(x train, y train, 
batch _size=128, 
epochs=25, 
verbose=2, 
validation data=(x test, y test), 
callbacks = [checkpointer]) 


Train on 6295 samples, validate on 4197 samples 

Epoch 1/25 

is - loss: 0.0633 - acc: 0.9900 - val_loss: 0.0639 - val_acc: 0.9914 
Epoch 2/25 

Os - loss: 0.0621 - acc: 0.9905 - val_loss: 0.0626 - val_acc: 0.9914 
Epoch 3/25 

Os - loss: 0.0639 - acc: 0.9897 - val_loss: 0.0637 - val_acc: 0.9912 
Epoch 4/25 


Figure A-7. The training function with verbosity 2 


callbacks: A list of keras.callbacks.Callback instances. Remember 
the ModelCheckpoints instance defined earlier as “checkpointer”? 
This is where you include it. To see how it’s done, refer to one of the 
above figures that showcase the model. fit() function being called. 


validation_split: A float value between 0 and 1 that tells the model 
how much of the training data should be used as validation data. 


validation_data: A tuple (x_val, y_val) or (x_val, y_val, val_sample_ 
weights) with variable parameters that pass the validation data to 
the model, and optionally, the val_sample_weights as well. This also 
overrides validation_split, so use one or the other. 


shuffle: A Boolean that tells the model whether or not to shuffle 
the training data before each epoch, or pass in a string for “batch’, 
meaning it shuffles in batch-sized chunks. 


class_weight: (optional) A dictionary that tells the model how to 
weigh certain classes in the training process. You can use it to weigh 
under-represented classes higher, for example. 


sample_weight: (optional) A Numpy array of weights that have a 1:1 
map between the training samples and the weight array you passed 
in. If you have temporal data (an extra time dimension), pass in a 2D 


325 


APPENDIX A 


Model 


INTRO TO KERAS 


array with a shape (samples, sequence_length) to apply these weights 
to each timestep of the samples. Don’t forget to set “temporal” for 
sample_weight_mode in model. compile(). 


initial_epoch: An integer that tells the model what epoch to start 


training at (can be used when resuming training). 


steps_per_epoch: The number of steps, or batches of samples, for 


the model to take before completing one epoch. 


validation_steps: (Only if you specify steps_per_epoch.) The number 
of steps to take (number of batches of samples) to use for validation 


before stopping. 


validation_freq: (Only if you pass in validation data.) If you pass in 
n, it runs validation every n epochs. If you pass in [a, e, h], it runs 


validation after epoch a, epoch e, and epoch h. 


Evaluation and Prediction 


After training the model, you can not only evaluate its performance on some test data, 


but you can make predictions and use the output for any other application you want. 


Previously, you've used the predictions to generate AUC scores to help better evaluate 


the model (accuracy is not the best metric to judge model performance by), but you can 


use these 


predictions in any way you want, especially if the model’s really good at its job. 


The code to evaluate your model on some test data might look similar to Figure A-8. 


im f 34 model.evaluate(x, y, verbose=0) 


Figure A-8. Code to evaluate the model given x and y data sets 


For model.evaluate(), the parameters are 


326 


x: The Numpy array representing the test data. Pass in a list of Numpy 


arrays if the model has multiple inputs. 


y: The Numpy array of target or label data that is a part of the test 
data. If there are multiple inputs, pass in a list of Numpy arrays. 


APPENDIXA INTRO TO KERAS 


e batch_size: If none is specified, the default is 32. This parameter 
expects an integer value that dictates how many samples there are 
per evaluation step. 


e verbose: If set to 0, no output is shown. If set to 1, the progress bar is 
shown and looks like Figure A-9. 


In [19]: 
score = TCN.evaluate(x_test, y test, verbose=1) 
print ("Test loss:', score[0]) 
print ('Test accuracy:', score[i)) 


4197/4197 [ SHSSSSSSSaaaaaaa aaa aaaaaaaaaa===) - ils 
Test loss: 0.06095811567112381 
Test accuracy: 0.9914224446032881 


Figure A-9. The evaluate function with verbosity I 


e sample_weight: (optional) A Numpy array of weights for each of 
the test samples. Again, either a 1:1 map between the sample and 
the weights, unless it’s temporal data. If you have temporal data (an 
extra time dimension), pass in a 2D array with a shape (samples, 
sequence_length) to apply these weights to each timestep of the 
samples. Don’t forget to set “temporal” for sample_weight_mode in 
model.compile(). 


e steps: If None, then ignored. Otherwise, it’s the integer parameter n 
number of steps (batches of samples) before declaring the evaluation 
as done. 


e callbacks: Works the same way as the callbacks parameter for model. 


fit(). 


Finally, to make predictions, you can run code similar to Figure A-10. 


In [ js: | model.predict (x) 


Figure A-10. The prediction function generates predictions given some data set x 


O20 


APPENDIXA INTRO TO KERAS 


In this case, the parameters are 


e x: The Numpy array representing the prediction data. Pass in a list of 
Numpy arrays if the model has multiple inputs. 


e batch_size: If none is specified, the default is 32. This parameter 
expects an integer value that dictates how many samples there are 
per batch. 


e verbose: Either a0 or 1. 


e steps: How many steps to take (batches of samples) before finishing 
the prediction process. This is ignored if None is passed in. 


e callbacks: Works the same way as the callbacks parameter for model. 
fit(). 
One more thing to mention: If you’ve saved a model, you can load it again by calling 


the code in Figure A-11. 


in if is from keras.models import load model 


model = load model('filepath.h5') 


Figure A-11. Loading a model given some file path 


Now that we've covered the basics of model construction and operation, let’s move 
on to the parts that constitute the models themselves: layers. 


Layers 
Input Layer 


keras. layers. Input() 


This is the input layer of the entire model, and it has several parameters: 


e shape: This is the shape tuple of integers that tells the layer what 
shape to expect. For example, if you pass in shape=(input_shape) and 
input_shape is (31, 1), you’re telling the model to expect entries that 
each have a dimension (31, 1). 


328 


APPENDIXA INTRO TO KERAS 


batch_shape: This is also a shape tuple of integers that includes the 
batch size. Passing in batch_shape = (input_shape), where input_ 
shape is (100, 31, 1), tells the model to expect batches of 100 31x1 
dimensional entries. Passing in an input_shape of (None, 31, 1) tells 
the model that the number of batches can be some arbitrary number. 


name: (Optional) A string name for the layer. It must be unique, and 
if nothing is passed in, some name is autogenerated. 


dtype: The data type that the layer should expect the input data to 
have, specified as a string. It can be something like ‘int32, “float32; etc. 


sparse: A Boolean that tells the layer whether or not the placeholder 


that the layer creates is sparse. 


tensor: (Optional) A tensor to pass into the layer to serve as the 
placeholder for input. If something is passed in, then Keras will not 
automatically create some placeholder tensor. 


Dense Layer 


keras. layers .Dense() 


This is a neural network layer comprised of densely-connected neurons. Basically, 


every node in this layer is fully connected with the previous and next layers if there are any. 


Here are the parameters: 


units: The number of neurons in this layer. This also factors into the 


dimension of the output space. 
activation: The activation function to use for this layer. 


use_bias: A Boolean for whether or not to use a bias vector in this 


layer. 


kernel_initializer: An initializer for the weight matrix. For more 


information, check out the Initializers section. 


bias_initializer: Similar to the kernel_initializer, but for the bias. 


329 


APPENDIXA INTRO TO KERAS 


e kernel_regularizer: A regularizer function that’s been applied to the 
weight matrix. For more information, check out the Regularizers 


section. 
e bias_regularizer: Regularizer function applied to the bias. 


e activity_regularizer: Regularizer function applied to the output of 
the layer. 


e kernel_constraint: A constraint function applied to the weights. For 


more information, check out the Constraints section. 
e bias_constraint: A constraint function applied to the bias. 


For a better idea of what a dense layer is, check out Figure A-12. 


Input 
Data Input Layer Dense Layer 1 Dense Layer 2 


OR FOR FOX cumin 
Vind — Wal - Nod 
EMA EVA YAO 
Oe OO 
OB OBMSOFAC 
Sao ae, 


Figure A-12. Dense layers in an artificial neural network 


330 


APPENDIXA INTRO TO KERAS 


Activation 
keras. layers .Activation() 


This layer applies an activation function to the input. Here is the argument: 


e activation: Pass in either the activation function (see the Activations 


section) or some Theanos or TensorFlow operation. 


To understand what an activation function is, Figure A-13 shows what each artificial 
neuron looks like. 






Xy * 
4, 
; 7 New Output 
X2 W 


Activation 
Function 






X a 


Figure A-13. The activation function is applied to the output of the function the 
node carries out on the input 


The activation passes in the output from the input « weights + bias and passes it into 
the activation function. If there is no activation function, then that input just gets passed 
along as the output. 


Dropout 
keras. layers .Dropout() 


What the dropout layer does is take some float f proportion of nodes in the preceding 
layer and “deactivates” them, meaning they don’t connect to the next layer. This can help 
combat overfitting on the training data. 


331 


APPENDIXA INTRO TO KERAS 


Here are the parameters: 


e rate: A float value between 0 and 1 that indicates the proportion of 


input units to drop. 


e noise_shape: A 1D integer binary tensor that is multiplied with 
the input to determine what units are turned on or off. Instead of 
randomly selecting values using rate, you can pass in your own 
dropout mask to use in the dropout layer. 


e seed: An integer to use as a random seed. 


Flatten 


keras.layers.Flatten() 


This layer takes all of the inputs and flattens them into a single dimension. 

Images can have three channels if they’re color images. They can be RGB (red, green, 
blue), BGR (blue, green, red), HSV (hue, saturation, value), etc., so the dimensions of 
these images are actually (height, width, channels) if it’s formatted channels last or 
(channels, height, width) if it’s formatted channels first. To preserve this formatting, 


there is a parameter you can pass in to the flatten layer: 


e data_format: A string that’s either ‘channels_first’ or ‘channels_last‘ 
This tells the flattening layer how to format the flattened output to 
preserve this formatting. 


To get a better idea of how the layer flattens the input, check out the summary in 
Figure A-14 of a convolutional neural network. 


332 


APPENDIXA INTRO TO KERAS 


In [5]: 
import keras 


from keras.models import Model 

from keras.layers import Input, Dense, Convolution2D, Flatten 

input_layer = Input (shape=(32, 32, 3)) 

conv_1 = Convolution2D(128, kernel _size=2, padding="same", activation='relu') (input layer) 
19 conv_2 = Convolution2D(128, kernel _size=2, padding="same', activation='relu') (conv_1) 

flat_1 = Flatten() (conv_2) 

classes = Dense(i0, activation="softmax') (flat _1) 


16 conv_net = Model(input_layer, classes) 


conv_net. summary () 


Layer (type) 6 | Output Shape oo .§ Param$ 
input_S (Inputtayer) (None, 32, 32, 3)” 
conv2d_6 (Conv2D) ~—~—~—«* (None, 32, 32, 128) 1664 ~~ 
flatten 2 (Platten) (None, 131072) 0 











Total params: 1,378,058 
Trainable params: 1,378,058 
Non-trainable params: 0 


Figure A-14. Notice how the flattening layer reduces the dimensionality of its input 


Spatial Dropout 1D 
keras. layers .SpatialDropout1D() 


This function drops entire 1D feature maps instead of neuron elements, but 
otherwise has the same functionality as the regular dropout function. In earlier 
convolutional layers, the feature maps tend to be strongly correlated, so regular dropout 
functions won't help much with regularization in that case. Spatial dropout helps address 
this and also helps improve independence between the feature maps themselves. 

The function takes one parameter: 


e rate: A float between 0 and 1 that determines the proportion of input 
units to drop. 


333 


APPENDIXA INTRO TO KERAS 


Spatial Dropout 2D 
keras. layers .SpatialDropout2D() 


This function is similar to the spatial dropout 1D function, except it works on 2D 
feature maps. Images can have three channels if they're color images. They can be 
RGB (red, green, blue), BGR (blue, green, red), HSV (hue, saturation, value), etc., so 
the dimensions of these images are actually (height, width, channels) if it’s formatted 
channels last or (channels, height, width) if it’s formatted channels first. 

This function takes one additional parameter compared to SpatialDropout1D(): 


e rate: A float between 0 and 1 that determines the proportion of input 
units to drop. 


e data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. 


Conv1D 


keras. layers.Conv1D() 


Check out Chapter 7 for a detailed explanation on how one-dimensional 
convolutions work. 

This layer is a one-dimensional (or temporal) convolutional layer. It basically passes 
a filter over the one-dimensional input and multiplies the values element-wise to create 
the output feature map. 

These are the parameters that the function takes: 


e filters: An integer value that determines the dimensionality of the 
output space. In other words, this is also the number of filters in the 


convolution. 


e kernel_size: An integer (or tuple/list of a single integer) that specifies 
the length of the filter/kernel that is used in the 1D convolution. 


e strides: An integer (or tuple/list of a single integer) that tells the 
layer how many data entries to shift by after one element-wise 
multiplication of the filter and the input data. Note: A stride value != 1 
isn’t compatible if the dilation_rate != 1. 


334 


APPENDIXA INTRO TO KERAS 


padding: ‘valid; ‘causal, or ‘same’ ‘valid’ doesn’t zero pad the output. 
‘same’ zero pads the output so that it’s the same length as the input. 
‘causal’ padding generates causal, dilated convolutions. For an 
explanation on what ‘causal’ padding is, refer to Chapter 7. 


data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. ‘channels_first’ has 
the format (batch, features, steps), and ‘channels_last’ has the 
format (batch, steps, features). 


dilation_rate: An integer (or tuple/list of a single integer) serves 
as the dilation rate for this dilated convolutional layer. For an 
explanation of how this works, refer to Chapter 7. 


activation: Passes in either the activation function (see the 
Activations section) or some Theanos or TensorFlow operation. 
If nothing is specified, the data is passed along unaltered after the 


convolutional process. 


use_bias: A Boolean for whether or not to use a bias vector in this 


layer. 


kernel_initializer: An initializer for the weight matrix. For more 


information, check out the Initializers section. 
bias_initializer: Similar to the kernel_initializer, but for the bias. 


kernel_regularizer: A regularizer function that’s been applied to the 
weight matrix. For more information, check out the Regularizers 


section. 
bias_regularizer: A regularizer function applied to the bias. 


activity_regularizer: A regularizer function applied to the output of 


the layer. 


kernel_constraint: A constraint function applied to the weights. For 
more information, check out the Constraints section. 


bias_constraint: A constraint function applied to the bias. 


335 


APPENDIXA INTRO TO KERAS 


Conv2D 


keras. layers.Conv1D() 


Check out Chapter 3 for a detailed explanation on how the 2D convolutional layer 
works. 

This layer is a two-dimensional convolutional layer. It basically passes a 2D filter over 
the input and multiplies the values element-wise to create the output feature map. 

These are the parameters that the function takes: 


e filters: An integer value that determines the dimensionality of the 
output space. In other words, this is also the number of filters in the 


convolution. 


e kernel_size: An integer (or tuple/list of two integers) that specifies 
the height and width of the filter/kernel that is used in the 2D 


convolution. 


e strides: An integer (or tuple/list of two integers, one for height and 
one for width, respectively) that tells the layer how many data entries 
to shift by after one element-wise multiplication of the filter and the 
input data. Note: A stride value != 1 isn’t compatible if the dilation_ 
rate != 1. 


e padding: ‘valid’ or ‘same: ‘valid’ doesn’t zero pad the output. ‘same’ 
zero pads the output so that it’s the same length as the input. 


e data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. 


e dilation_rate: An integer (or tuple/list of a two integers) serves as the 
dilation rate for this dilated convolutional layer. For an explanation of 
how this works, refer to Chapter 7. 


e activation: Passes in either the activation function (see the 
Activations section) or some Theanos or TensorFlow operation. 
If nothing is specified, the data is passed along unaltered after the 


convolutional process. 


e use_bias: A Boolean for whether or not to use a bias vector in this layer. 


336 


APPENDIXA INTRO TO KERAS 


e kernel_initializer: An initializer for the weight matrix. For more 


information, check out the Initializers section. 
e bias_initializer: Similar to the kernel_initializer, but for the bias. 


e kernel_regularizer: A regularizer function that’s been applied to the 
weight matrix. For more information, check out the Regularizers 


section. 
e bias_regularizer: A regularizer function applied to the bias. 


e activity_regularizer: A regularizer function applied to the output of 
the layer. 


e kernel_constraint: A constraint function applied to the weights. For 


more information, check out the Constraints section. 


e bias_constraint: A constraint function applied to the bias. 


UpSampling 1D 
keras.layers.UpSampling1D() 


For a detailed explanation on how upsampling works, refer to Chapter 7. 
This layer essentially repeats the data n times with respect to time (where n is the 
parameter passed in): 


e size: An integer n that specifies how many times to repeat each data 
entry with respect to time. The order of time is preserved, so each 


element is repeated n times according to its time entry. 


UpSampling 2D 
keras. layers .UpSampling2D() 


Similar to UpSampling1D(), but for 2D inputs. The rows and columns are repeated n 
times according to size[0] and size[1]. 
This is the list of parameters: 


e size: An integer or tuple of two integers. The integer is the 
upsampling factor for both rows and columns, and the tuple lets you 
specify the upsampling factor for rows and for columns individually. 


337 


APPENDIXA INTRO TO KERAS 


e data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. 


e interpolation: ‘nearest’ or ‘bilinear: CNTK does not support 
‘bilinear’ yet, and Theanos only supports size=(2,2). ‘nearest’ and 


‘bilinear’ are interpolation techniques used in image processing. 


ZeroPadding1D 


keras. layers.ZeroPadding1D( ) 


Depending on the input, pads the input sequence with zeroes on both sides or either 
a zero on the left side or a zero on the right side of the input sequence. 
This is the list of parameters: 


e padding: An integer, a tuple of two integers, or a dictionary. The 
integer is a number that tells the layer how many zeroes to add on 
both the left and right side. An input of 1 adds a zero on both the left 
and right side. The tuple is formatted as (left_pad, right_pad), so 
you can pass in (0, 1) to tell it to add no zeroes on the left side and 
add one zero on the right side. 


ZeroPadding2D 


keras. layers.ZeroPadding2D() 


Depending on the input, it pads the input sequence with a row and columns of 
zeroes at the top, left, right, and bottom of the image tensor. 
This is the list of parameters: 


e padding: An integer, a tuple of two integers, a tuple of two tuples 
with two integers each. The integer tells it to add n rows of zeroes on 
the top and bottom of the image tensor, and n columns of zeroes. 
The tuple of two integers is formatted as (symmetric height pad, 
symmetric width pad), so you can tell the layer to add m rows of 
zeroes and n columns of zeroes to each side, respectively, if you pass 
in a tuple (m, n). Finally, the tuple of two tuples is formatted as ((top__ 
pad, bottom pad), (left _pad, right _pad)), so you can customize 
even more how you want the layer to add rows or columns of zeroes. 

338 


APPENDIXA INTRO TO KERAS 


e data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. 


MaxPooling1D 
keras. layers .MaxPooling1D() 


It applies max pooling on a 1D input. To get a better idea of how max pooling works, 
check out Chapter 3. Max pooling in 1D is similar to max pooling in 2D, except the 
sliding window only works in one dimension, going from left to right. 


This is the list of parameters: 


e pool _size: An integer value. If an integer n is given, then the 
window size of the pooling layer is 1xn. These are also the factors to 
downscale by, so if an integer n is passed in, the dimensions for both 
height and width are downscaled by that factor. 


e strides: An integer or None. By default, the stride is set to pool_size. 
If you pass in an integer, the pooling window moves by integer n 
amount after completing its pooling operation on a set of entries. 


e padding: ‘valid’ or ‘same: ‘valid’ means there’s no zero padding, and 
‘same’ pads the output sequence with zeroes so that it matches the 
dimensions of the input sequence. 


e data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. ‘channels_first’ has 
the format (batch, features, steps), and ‘channels_last’ has the 
format (batch, steps, features). 


MaxPooling2D 
keras. layers .MaxPooling2D() 


It applies max pooling on a 2D input. To get a better idea of how max pooling works, 
check out Chapter 3. 


339 


APPENDIX A 


INTRO TO KERAS 


This is the list of parameters: 


pool_size: An integer that dictates the size of the pooling window. 
An integer of n makes the pooling window size n, meaning it sifts 
through n entries at a time and selects the maximum value to pass on 


to the output. 


strides: An integer or None. By default, the stride is set to pool_size. 
If you pass in an integer, the pooling window moves by integer n 
amount after completing its pooling operation on a set of entries. It is 
also a factor that determines how much to downscale the dimensions 
by, as a parameter n will reduce the dimensions by a factor n. 


padding: ‘valid’ or ‘same’ ‘valid’ means there’s no zero padding, and 
‘same’ pads the output sequence with zeroes so that it matches the 


dimensions of the input sequence. 


data_format: ‘channels_first’ or ‘channels_last: This tells the 
flattening layer how to format the flattened output to preserve the 
formatting of channels first or channels last. 


Loss Functions 


In the examples, y_true is the true label and y_pred is the predicted label. 


Mean Squared Error 


keras.losses.mean squared error(y true, y pred) 


If you 


have questions on the notation for this equation, refer to Chapter 3. See the 


equation in Figure A-15. 


n 


(0) = = (ho(x") -y')? 


i=1 


Figure A-15. The equation for mean squared error 


340 


APPENDIXA INTRO TO KERAS 


Given input 0, the weights, the formula finds the average difference squared between 
the predicted value and the actual value. The parameter h, represents the model with 
the weight parameter 0 passed in, so h,(x') gives the predicted value for x’ with model’s 
weights 0. The parameter y’ represents the actual prediction for the data point at index i. 
Lastly, there are n entries in total. 

This loss metric can be used in autoencoders to help evaluate the difference between 
the reconstructed output and the original. In the case of anomaly detection, this metric 
can be used to separate the anomalies from the normal data points, since anomalies 


have a higher reconstruction error. 


Categorical Cross Entropy 
keras.losses.categorical crossentropy(y true, y pred) 


See the equation in Figure A-16. 


J(0) = — = y, « log(ho(x)) + (1 — y,) # log( — g(x) 
- i=0 


Figure A-16. The equation for categorical cross entropy 


In this case, n is the number of samples in the whole data set. The parameter hy, 
represents the model with the weight parameter 0 passed in, so h,(x;,) gives the predicted 
value for x; with model’s weights 0. Finally, y; represents the true label for data point 
at index i. The data needs to be regularized to be between 0 and 1, so for categorical 
cross entropy, it must be piped through a softmax activation layer. The categorical cross 
entropy loss is also called softmax loss. 

Equivalently, you can write the previous equation as shown in Figure A-17. 


1 mem 
J) === > yy log(he(x,) 


i=0 j=0 
Figure A-17. Another way to write the equation for categorical cross entropy 


In this case, m is the number of classes. 

The categorical cross entropy loss is a commonly used metric in classification tasks, 
especially in computer vision with convolutional neural networks. Binary cross entropy 
is a special case of categorical cross entropy where the number of classes m is two. 


34] 


APPENDIXA INTRO TO KERAS 


Sparse Categorical Cross Entropy 
keras.losses.sparse categorical crossentropy(y true, y pred) 


Sparse categorical cross entropy is basically the same as categorical cross entropy, 
but the distinction between them is in how their true labels are formatted. For 
categorical cross entropy, the labels are one-hot encoded. For an example of this, refer 
to Figure A-18, if you had your y_train formatted originally as the following, with six 
maximum classes. 


Data at index 0 


Data at index 1 


hh Ue 


Data at index 2 
Data at index 3 2 


Figure A-18. An example of how y_train can be formatted. The value in each 
index is the class value that corresponds to the value at that index in x_train 


You can call keras.utils.to categorical(y train, n classes) with n_classes as 
6 to convert y_train to that shown in Figure A-19. 


DataatindexO | Q1QO0Q0 
Dataatindex1 | QOQQOQOO1 
Data atindex2 | OQOQO010 
Data atindex3 | 001000 


Figure A-19. The y_ train in Figure A-18 is converted into a one-hot encoded 
format 


So now your y_train looks like Figure A-20. 


342 


APPENDIXA INTRO TO KERAS 


In [7]: y train = [2,. 3; %,. 23 


keras.utils.to categorical(y train, 6) 


Gutti7j: arrzay([((0.,. 2., 0.6; Giz Ge, O.], 
iGo: Gig Bay Bay Gay: La], 
[O., 0., 0., 0., 1., 0.], 
[O., 0., 1., 0., 0., 0.)]) 


Figure A-20. Converting y_train into a one-hot encoded format in Jupyter 


This type of truth label formatting (one-hot encoding) is what categorical cross 
entropy uses. For sparse categorical cross entropy, it suffices to simply pass in the 


information in Figure A-21. 


Data at index 0 


Data at index 1 


Rue 


Data at index 2 


Data at index 3 2 


Figure A-21. The y_ train to pass in for sparse categorical cross entropy 


Or the code shown in Figure A-22. 
In [7]: lL | y_train = (1, 5, 4, 2] 


Figure A-22. An example of y_train in the code that can be passed in if sparse 
categorical cross entropy is the metric 


Metrics 
Binary Accuracy 


keras.metrics.binary accuracy(y true, y pred) 


To use this function, the ‘accuracy’ must be a metric that’s passed into the model. 


compile() function, and binary cross entropy must be the loss function. 


343 


APPENDIXA INTRO TO KERAS 


Essentially, the function finds the number of instances where the true class label 
matches the rounded prediction label and finds the mean of the result (which is the 
same thing as dividing the total number of correct matches by the total number of 
samples). 

The predicted values are rounded since as the neural network is trained more and 
more, the output values tend to change so that the predicted value is something really 
close to one, and the rest of the value are something really close to zero. In order to 
match the predicted values to the original truth labels (which are all integers), you can 
simply round the predicted values. 

In the official Keras documentation on GitHub, this function is defined as shown in 
Figure A-23. 


def binary_accuracy(y_true, y_pred): 
"""Calculates the mean accuracy rate across all predictions for binary 
classification problems. 


return K.mean(K.equal(y_true, K.round(y_pred))) 


Figure A-23. The code definition in the Keras GitHub page of binary accuracy 


Categorical Accuracy 
keras.metrics.categorical accuracy(y true, y pred) 


Since most problems tend to involve categorical cross entropy (implying more than 
two classes in the data set), this tends to be the default accuracy metric when ‘accuracy’ 
is passed into the model. compile() function. 

Instead of finding all of the instances where the true labels and rounded predictions 
match, categorical accuracy finds all of the instances where the true labels and 
predictions have a maximum value in the same spot. 

Recall that for categorical cross entropy, the labels are one-hot encoded. Therefore, 
the truth labels only have one maximum per entry, along with the predictions (though 
again, one value will be really close to one while the others are really close to zero). What 
categorical accuracy does is check if the maximum value in the entry is in the same 
position for both y_true and for y_pred. 

Once it’s found all those instances, it finds the mean of the result, leading to an 
accuracy value. 

Essentially, it’s a similar equation to the one for binary accuracy, but with a different 
condition regarding y_true and y_pred. 


344 


APPENDIXA INTRO TO KERAS 
The function is defined by Keras as shown in Figure A-24. 


def categorical_accuracy(y_true, y_pred): 
""*Calculates the mean accuracy rate across all predictions for 
multiclass classification problems. 


return K.mean(K.equal(K.argmax(y_true, axis=-1), 
K.argmax(y_pred, axis=-1))) 


Figure A-24, The code definition of categorical accuracy as seen in the Keras 
GitHub page 


Of course, there are many more metrics that are available on the Keras 
documentation, and you can even define custom metrics. To do that, just simply define 
a function that takes in y_true and y_pred, and call that function name in your metrics, 
as shown in Figure A-25. 


In [ ]: import keras.backend as K 
def custom_metric(y true, y pred): 
matches = K.equal(y true, K.round(y pred) ) 
score = K.mean(matches) 
return score 
model.compile (optimizer="optimizer', 
loss="loss function’, 
metrics=['accuracy", custom _metric]) 


Figure A-25. Code to define a custom metric and use that for the model 


In this example, you simply rewrite the binary accuracy metric in several lines and 
return the score. You can actually condense this function to just one line like in the actual 


implementation seen above, but this is just an example to showcase a custom metric. 


Optimizers 
SGD 
keras.optimizers.SGD() 


This is the stochastic gradient descent optimizer, a type of algorithm that aids in 
the backpropagation process by adjust the weights. It is commonly used as a training 


algorithm in a variety of machine learning applications, including neural networks. 


345 


APPENDIXA INTRO TO KERAS 


The optimizer has several parameters: 


e Ir: Some float value where the learning rate Ir >= 0. The learning rate 
is a hyperparameter that determines how big of a step to take when 


optimizing the loss function. 


e momentum: Some float value where the momentum m >= 0. 
This parameter helps accelerate the optimization steps in the 
direction of the optimization, and helps reduce oscillations when 
the local minimum is overshot (refer to Chapter 3 to refresh your 


understanding on how a loss function is optimized). 


e decay: Some float value where the decay d >= 0. Helps determine 
how much the learning rate decays by after each update (so that as 
the local minimum is approached, or after some number of training 
iterations, the learning rate decreases so smaller step sizes are taken. 
Big learning rates means the local minimum might be overshot more 


easily). 


e nesterov: A Boolean value to determine whether or not to apply 
Nesterov momentum. Nesterov momentum is a variation of 
momentum where the gradient is computed not from the current 
position, but from a position that takes into account the momentum. 
This is because the gradient always points in the right direction, 
but the momentum might carry the position too far forward and 
overshoot. Since it doesn’t use the current position but instead 
some intermediate position that takes into account momentum, the 
gradient from that position can help correct the current course so 
that the momentum doesn’t carry the new weights too far forward. 


It essentially helps for more accurate weight updates and helps converge faster. 


Adam 


keras.optimizers.Adam() 


The Adam optimizer is an algorithm that extends upon SGD, and has grown quite 
popular in deep learning applications in computer vision and in natural language 


processing. 


346 


APPENDIXA INTRO TO KERAS 


These are the parameters for the algorithm: 


e Ir: Some float value where the learning rate lr >= 0. The learning rate 
is a hyperparameter that determines how big of a step to take when 
optimizing the loss function. The paper describes good results with a 
value of 0.001 (the paper refers to the learning rate as alpha). 


e beta_1: Some float value where 0 < beta_I < 1. This is usually some 
value close to 1, but the paper describes good results with a value of 0.9. 


e beta_2: Some float value where 0 < beta_2 < 1. This is usually some 
value close to 1, but the paper describes good results with a value 
of 0.999. 


e epsilon: Some float value where epsilon e >= 0. If None, then it 
defaults to K.epsilon(). Epsilon is some small number, described as 
10E-8 in the paper, to help prevent division by 0. 


e decay: Some float value where the decay d >= 0. Helps determine 
how much the learning rate decays by after each update (so that as 
the local minimum is approached, or after some number of training 
iterations, the learning rate decreases so smaller step sizes are taken. 
Big learning rates means the local minimum might be overshot more 


easily). 


e amsgrad: A Boolean on whether or not to apply the AMSGrad 
version of this algorithm. For more details on the implementation 
of this algorithm, check out “On the Convergence of Adam and 
Beyond.’ 


RMSprop 


keras.optimizers.RMSprop() 


RMSprop is a good algorithm for recurrent neural networks. RMSprop is a gradient- 
based optimization technique developed to help address the problem of gradients 
becoming too large or too small. RMSprop helps combat this problem by normalizing 
the gradient itself using the average of the squared gradients. In Chapter 7, it’s explained 
that one of the problems with RNNs is the vanishing/exploding gradient problem, 
leading to the development of LSTMs and GRU networks. And so it’s of no surprise that 
RMSprop pairs well with recurrent neural networks. 


347 


APPENDIX A 


INTRO TO KERAS 


Besides the learning rate, it’s recommended to leave the rest of the algorithms in 


their default settings. With that in mind, here are the parameters for this optimizer: 


Ir: Some float value where the learning rate Ir >= 0. The learning rate 
is a hyperparameter that determines how big of a step to take when 


optimizing the loss function. 


rho: Some float value where rho >= 0. Rho is a parameter that helps 
calculate the exponentially weighted average over the gradients squared. 


epsilon: Some float value where epsilon e >= 0. If None, then it 
defaults to K.epsilon(). Epsilon is a very small number that helps 
prevent division by 0 and to help prevent the gradients from blowing 
up in RMSprop. 


decay: Some float value where the decay d >= 0. Helps determine how 
much the learning rate decays by after each update (so that as the local 
minimum is approached, or after some number of training iterations, 

the learning rate decreases so smaller step sizes are taken. Big learning 


rates means the local minimum might be overshot more easily). 


Activations 


You can pass in something like ‘activation function’ for the activation parameter ina 


layer, or the full function, keras.activations.activation function(), if you want to 


customize it more. Otherwise, the default initialized activation function is used in the layer. 


Softmax 


keras.activations.softmax() 


This performs a softmax activation on the input x and on the given axis. 


The two parameters are 


x: The input tensor 


axis: The axis that you want to use softmax normalization on. By 


default, it is set to -1. 


The general formula for softmax is shown in Figure A-26 (K is the number of 


samples). 


348 


APPENDIXA INTRO TO KERAS 


weg 
el 
(x); = = oe, For j = 1,...,K and x = (x,, -.., %) © R* 
jae 


Figure A-26. The general formula for softmax 


ReLU 


keras.activations.relu() 


ReLU, or “Rectified Linear Unit’, performs a simple activation based on the function 
shown in Figure A-27. 


f(x) = max(0, x) 
Figure A-27. This is the general ReLU formula 
The parameters are as follows: 


e x: The input tensor 


e alpha: A float that determines the slope of the negative part. Set to 
zero by default. 


e max value: A float value that represents the upper threshold, and is 
set to None by default. 


e threshold: A float value set to 0.0 by default that’s the lower 
threshold. 


If max_value is set, then you get the equation shown in Figure A-28. 


f(x) = max_value for x > max_value 


Figure A-28. The ReLU formula if max_value is set 


If threshold is also set, then you get the equation shown in Figure A-29. 


f(x) =x  forthreshold < x < max_value 


Figure A-29. The ReLU formula if threshold is also set 


349 


APPENDIXA INTRO TO KERAS 
Otherwise you get the equation shown in Figure A-30. 
f(x) = alpha «(x — threshold) 
Figure A-30. The formula for ReLU if alpha and threshold are set 


For an example of what the base ReLU function does, refer to Figure A-31. 






y =xforallx>0O 


y=Oforallx<0O 


Figure A-31. The graph for a basic ReLU function 
Sigmoid 
keras.activations.sigmoid(x) 

This is a simple activation function to call, as there are no parameters other than the 
input tensor x. 

The sigmoid function does have its uses, primarily because it forces the input to be 
between 0 and 1, but it is prone to the vanishing gradient problem, and so it is seldom 


used in hidden layers. 


To get an idea of what the equation is like when graphed, refer to Figure A-32. 


350 


APPENDIXA INTRO TO KERAS 








=10.0 =F =5.0 =15 oD £3 =O iS 10.0 


Figure A-32. The graph of a sigmoid function 


Callbacks 
ModelCheckpoint 


keras.callbacks.ModelCheckpoint() 


ModelCheckpoint is basically a function that saves the model every epoch (unless 
otherwise directed via parameters). How it does so can be configured by the set of 
parameters associated with ModelCheckpoint (): 


e filepath: The path where you want to save the model file. Typing just 
“model_name.h5” saves it in the same directory. 


e monitor: The quantity that you want the model to monitor. By 
default, it’s set to “val_loss” 


e verbose: Sets verbosity to 0 or 1. It’s set to 0 by default. 


e save_best_only: If set to true, then the model with the best 
performance according to the quantity monitored will be saved. 


e save_weights_only: If set to True, then only the weights will be saved. 
Essentially, if True, model.save weights(filepath); else, model. 
save(filepath). 


351 


APPENDIX A 


INTRO TO KERAS 


mode: Choose between auto, min, or max. If save_best_only is True, 
then you should pick a choice that would suit the monitored quantity 
best. If you chose val_acc for monitor, then you want to pick max for 


mode, and if you choose val_loss for monitor, pick min for mode. 


period: How many epochs there are between each checkpoint. 


TensorBoard 


keras.callbacks.TensorBoard() 


TensorBoard is a visualization tool that comes with TensorFlow. It helps you see in 


detail what’s going on as your model trains. 


To launch TensorBoard, type this into the command prompt: 


tensorboard --logdir=/full path to your logs 


keras.callbacks.TensorBoard(log dir='./logs', histogram freq=0, batch_ 


S1ze=32, 


write graph=True, write grads=False, write _images=False, 


embeddings freq=0, embeddings layer names=None, embeddings metadata=None, 


embeddings data=None, update freq='epoch' ) 


With that, here is the list of parameters: 


352 


log dir: The path to the directory where you want the model to save 
the log files. This is the same directory you pass as an argument in the 
command prompt. It is ‘/logs’ by default. 


histogram_freq: The frequency (in epochs) that you want the 
activation and weight histograms to be computed for the model’s 
layers. Set to 0 by default, which means it won’t compute histograms. 
To visualize these histograms, validation_data (or validation_split) 
must be passed in. 


batch_size: The size of each batch of inputs to pass into the network 
to compute histograms from. Set to 32 by default. 


write_graph: Whether or not to allow the graph to be visualized in 
TensorBoard. Set to True by default. Note: When set to True, the log 
files can become large. 


APPENDIXA INTRO TO KERAS 


e write_grads: Whether or not to allow TensorBoard to visualize 
the gradient histograms. Set to False by default, and also needs 
histogram_freq to be a value greater than 0. 


e write_images: Whether or not to visualize the model weights as an 
image in TensorBoard. Set to False by default. 


e embeddings_freq: The frequency, in epochs, to save selected 
embedding layers. Set to 0 by default, which means that the 
embeddings won't be computed. To visualize data in TensorBoard’s 
Embedding tab, pass in the data as embeddings_data. 


e embeddings_layer_names: The list of names of layers for 
TensorBoard to track. If None or an empty list, then all of the layers 
will be watched. Set to None by default. 


e embeddings_metadata: A dictionary that maps layer names to the 
corresponding file names where the metadata for this embedding 
layer is saved. Set to None by default. If the same metadata file is used 
for all of the embedding layers, then a string can be passed. 


e embeddings_data: The data to be embedded at the layers specified 
in embeddings_layer_names. This is a Numpy array if the model 
expects a single input, and multiple Numpy arrays if the model has 
multiple inputs. Set to None by default. 


e update_freq: A ‘batch, ‘epoch, or integer. ‘batch’ writes the losses and 
metrics to TensorBoard after each batch. ‘epoch’ is the same, except 
the losses and metrics are written to TensorBoard after each epoch. 
The integer tells it to write the metrics and losses to TensorBoard 
every integer n samples, where n is the integer passed in. Note: 
Writing to TensorBoard too frequently can slow down the training 
process. 


With that being said, Figure A-33 shows an example of using TensorBoard as a 
callback when training a convolutional neural network on the MNIST data set. 


353 


APPENDIXA INTRO TO KERAS 





tensorboard = keras.callbacks.TensorBoard(log dir='./Graph', 
histogram freq=0, 


write graph=True, write images=True) 


checkpoint = 
keras.callbacks.ModelCheckpoint(filepath="keras MNISI CNN.hS”, 


verbose=0, 


save: best. only=True) 


modél.«f10(x Crain, y train, 
batch size=batch size, 
epochs=n epochs, 
verbose=l, 
Validation data=(x test, y test), 


callbacks=[checkpoint, tensorboard] ) 


Figure A-33. Code to define a TensorBoard callback and use that when training 


Once you execute that code, you will notice the training process will begin. At this 


point, enter the line 


tensorboard --logdir=/full path to your logs 


into your command prompt and press Enter. It should show you something like 
Figure A-34. 


354 


APPENDIXA INTRO TO KERAS 


TensorBoard 1.10. at http://MSI:6006 (Press CTRL+C to quit) 
Figure A-34. You should see something like this after executing the above line in 


command prompt. It should tell you where to go to access TensorBoard, which is 
http://MSI:6006 in this case 


Simply follow that link and you should see the screen shown in Figure A-35. 





















































(D Stow date downinad links O., Fittor tags (regular expeonsions supported) 
ignere outliers in chan scaling 
acc 
Tesla poctang me nodt cea 7 
Ce 
Sanpottang oo —t— 
7 —— 
——_—_—_-# of ons Pal 
bard ra 
Horizontal dina ay i 
SS j 
oe well 
t 
Pus ; = EC) 
Wire a nese 
i S loss 
hee 
|} 
ane \ 
13a \ 
meng 
i 
en ee 
————— 
Ls | 
=o 
wal wor 
ww res 


POLE ALS PSE 


Figure A-35. The general page that appears when you launch TensorBoard 


355 


APPENDIXA INTRO TO KERAS 


From here, you can see graphs for the metrics accuracy and loss. You can expand the 
other two metrics, val_acc and val_loss, to view those graphs as well (see Figure A-36). 


val_acc 


val_acc 


val_loss 


val_loss 





2 = 
Ld — 


Figure A-36. Graphs for val_acc and val_loss 


356 


APPENDIXA INTRO TO KERAS 


As for the individual graphs, you can expand them out by pressing the leftmost 
button below the graph, and you can view data on the graph as you move your mouse 
across it, as seen in Figure A-37. 


axe 





Name Smoothed Value Step Time 


0.5635 3200 Mt 


Figure A-37. The result of pressing the leftmost button underneath the graph. 
Doing so expands the graph, and regardless of whether the graph is expanded or 
not, you can point your mouse cursor at any point along the graph to get more 
details about that point 


You can also view a graph of the entire model by pressing the Graphs tab, as shown 
in Figure A-38. 


TensorBoard SCALARS GRAPHS 





Figure A-38. There are two tabs. You started on the tab named SCALARS. Press 
GRAPHS to switch the tab 


357 


APPENDIXA INTRO TO KERAS 


Doing so will result in a graph similar to the one shown in Figure A-39. 





Main Graph ALbUlary Modes 
Dione ti itsts—‘“‘<“‘<“<CSOS ae aie aa 
= joes iad} 
cre deree2 jo, — = 
nan. a 
Upena c ane Adem 
BD Trace inputs 
Coder (Gf) Reaches 
O te dense. T 
oe. | 
= iW atten 
O) metonpe a 
= dropout} f= se 
fap) 0 (Gee) — 


Pave ee” | 


Figure A-39. The result of clicking on the GRAPHS tab 


There are definitely more features and functionality that TensorBoard offers, but the 
general idea is that you will be able to examine your models in a much better fashion. 


Back End (TensorFlow Operations) 


You can also perform operations with TensorFlow (if it is the back end) through Keras by 
importing the back end. Below, we will demonstrate some basic functions, but keep in 
mind that TensorFlow has a vast variety of operations and functions. 

You can use the back end to create custom layers, metrics, loss functions, etc., 
allowing for a much deeper level of customization. However, you must basically be 
knowledgeable in TensorFlow to accomplish all of this, since this is practically just using 
TensorFlow. 

If you want the most customization possible, then using tf.keras along with 
TensorFlow is better, since tf.keras is wholly compatible with all of TensorFlow, and 
you Il have access to many more TensorFlow commands that you can’t get with just the 
Keras back end. 

Here are some of the commands you can execute using the back end (Figure A-40, 
Figure A-41, Figure A-42, Figure A-43). 


358 


APPENDIXA INTRO TO KERAS 


In [125]: | import keras.backend as K 


Declaring a placeholder 
= K.placeholder (shape=(1,2,3)) # Equivalent to tf.placeholder () 
print (a) 


vals = [0, 1, 2, 3, 4, 5] 
8 b = K.variable(value=vals) # Equivaient to tf.Variable() 
9 print (b) 


Tensor ("Placeholder 62:0", shape=(1, 2, 3), dtype=float32) 
<tf£.Variable "Variable 65:0" shape=(6,) dtype=float32_ref> 


Figure A-40. Some TensorFlow operations such as defining placeholders and 
variables done through the Keras back end 


In [126]: 1 import keras.backend as K 


K.placeholder(shape=(1, 2)) 
K.placeholder(shape=(2, 5)) 


7 & 
46d 


print (K.dot(c, d)) 


Tensor ("MatMul_ 13:0", shape=(1, 5), dtype=float32) 


Figure A-41. Finding the dot product of two placeholder variables c and d using 
the Keras back end 


In [134]: | print(K.sum(c, axis=0)) 
print(K.sum(c, axis=1)) 


Tensor ("Sum_7:0", shape=(2,), dtype=float32) 
Tensor ("Sum_8:0", shape=(1,), dtype=float32) 


Figure A-42. Finding the sum of c along different axes using the Keras back end 


In [136]: | print(K.mean(c, axis=0)) 


Tensor("Mean 1:0", shape=(2,), dtype=float32) 
Figure A-43. Finding the mean of c using the Keras back end 


Those are just some of the most basic functions available through the back end. The 
complete list of backend functions is available at https: //keras.io/backend/. 


359 


APPENDIXA INTRO TO KERAS 


Summary 


Keras is a great tool to help you easily get involved with creating, training, and testing deep 
learning models, and provides a great deal of functionality while abstracting away the 
complicated syntax that TensorFlow has. Keras by itself can be sufficient, but as the content 
gets more advanced, it's better to have the level of customization and flexibility that 
TensorFlow or PyTorch offers. Keras allows you to use a wide variety of functions through 
the back end, allowing you to write custom layers, custom models, metrics, loss functions, 
and so on, but for the most customization and flexibility in how you want your neural 
networks to be (especially if you want to make completely new types of neural networks), 
then either tf:keras + TensorFlow or PyTorch would be better suited for your needs. 


360 


APPENDIX B 


Intro to PyTorch 


In this appendix, you will be introduced to the PyTorch framework along with the 
functionality that it offers. PyTorch is more involved than Keras is, and it is a lower-level 
framework (meaning there’s more syntax, and elements aren’t abstracted away from you 
like in Keras). 

Regarding the setup, we use 


e Torch version 0.4.1 (PyTorch) 
e CUDA version 9.0.176 


e cuDNN version 7.3.0.29 


What Is PyTorch? 


PyTorch is a deep learning library for Python, developed by artificial-intelligence 
researchers at Facebook and based on the Torch library. While PyTorch is also a low- 
level language like TensorFlow, it is easier to pick up because of the huge difference in 
syntax. TensorFlow has a much steeper learning curve, and you have to define a lot more 
elements than in PyTorch. 

TensorFlow at the moment far surpasses PyTorch in how much community support 
it has, and this is primarily because PyTorch is a relatively new framework. Although you 
will find more resources for TensorFlow, more and more people are switching to PyTorch 
due to it being more intuitive while still offering practically the same functionality as 
TensorFlow (though TensorFlow does have some functions that PyTorch does not, 
you can easily implement those functions in PyTorch if you know what the logic is; an 
example of this is arctanh function). 

In the end, it is mostly a matter of personal preference when deciding to use 
TensorFlow or PyTorch. Depending on the context of your work, one framework might 
be more suitable than the other. 


361 
© Sridhar Alla, Suman Kalyan Adari 2019 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 
https://doi.org/10.1007/978-1-4842-5177-5 


APPENDIX B INTRO TO PYTORCH 


That being said, PyTorch might be easier to use for research purposes, considering 
that it is easier to prototype in due to the lessened burden from the syntax. On the other 
hand, TensorFlow has more resources and the advantage of having TensorBoard. It 
is also better suited for cross-platform compatibility, since a model can be trained in 
Python but deployed in Java, for example, allowing for better scalability. If loading and 
saving models is a priority, perhaps TensorFlow is more suitable. Again, it all comes 
down to personal preference, since there’s usually a workaround for many of the 
problems that both frameworks might face. 


Using PyTorch 


This section will be a bit different from the previous appendix. Here, we will demonstrate 
how some basic tensor operations are done, and then move on to illustrating how to 

use PyTorch by exploring PyTorch equivalent models of the temporal convolutional 
networks in Chapter 7. 

First, let’s begin by looking at some simple tensor operations. If you would like to 
know more about the framework itself and the functionality that it supports, check out 
the documentation at https: //pytorch.org/docs/0.4.1/index. html 
and the code implementation at https: //github.com/pytorch/pytorch. 

Let’s begin (see Figure B-1). 


362 


APPENDIX B INTRO TO PYTORCH 


In [53]: _ import torch 
2 import torch.nn 
3 import numpy as np 


a = np.random.randint(0, 10, 5) 
& torch. tensor (a) 
a 


Out[353): tensor([0, 3, 0, 9, 4], dtype=torch.int32) 

In [54]: 1 b = torch.tensor(np.random.randint(0, 10, 5)) 
2 b 

Out(54): tensor([3, 9, 7, 4, 2], dtype=torch.int32) 


In [55]: 1 ¢ = torch.add(a, b) 
¢ summed = torch.sum(c) 


4c, ¢_summed 
Out[55]): (tensor([{ 3, 12, 7, 13, 6], dtype=torch.int32), tensor (41) ) 
In [56]: 1 d = torch.tanh(b. float ()) 
d 


Out[(56): tensor([0.9951, 1.0000, 1.0000, 0.9993, 0.9640])) 


In [5S/]: | toerch.mean (d) 


Out([S7): tensor(0.9917) 


Figure B-1. A series of tensor operations in PyTorch. The code shows the operation 
and the output shows the results after the operations were performed on the 
corresponding tensors 


With PyTorch, you can see that the data values like the tensors are some sort of array, 
unlike in TensorFlow. In TensorFlow, you must run the variable through a session to be 
able to see the data values. 

In comparison, Figure B-2 shows TensorFlow. 


363 


APPENDIX B INTRO TO PYTORCH 


In [78]: import tensorflow as tf 
2 import numpy as np 


£f = tf.constant (np.random.randint(0, 10, 5), shape=(1, 5), dtype=tf.int32) 
print (f) 

g = tf.constant (np.random.randint(0, 10, 5), shape=(1, 5), dtype=tf.int32) 
print (g) 

result ~A+5B 

tanh = tf.tanh(tf.to float(result) ) 


with tf.Session() as sess: 
print("f: {} g: {}\n".format(sess.run(f), sess.run(g))) 
print("f£ + g: {}\n".format(sess.run(result) )) 
print ("tanh(ftg): {}".format(sess.run (tanh) ) ) 


Tensor ("Const _37:0", shape=(1, 5), dtype=int32) 
Tensor ("Const_38:0", shape=(1, 5), dtype=int32) 
£: ({5 6 9 7 0)] g: ((3 9 4 8 9)) 


£+gq: ([([14 16 6 5 16)]} 


tanh(f+gq): [[1. 1. 0.99998784 0.99990916 1. })) 


Figure B-2. Some tensor operations conducted in TensorFlow. Note that to 
actually see results, you need to pass everything through a TensorFlow session 


PyTorch has much more functionality in how you can manipulate tensors, so it’s 
worth checking out the documentation if you haven't. 

Now, let’s move on to creating a PyTorch model in a somewhat advanced, but 
organized format. Splitting up the definition of the model, the training process, and the 
testing process into their respective parts will help you understand how these models are 
created, trained, and evaluated. 

You start by applying a convolutional neural network to the MNIST data set in order 
to showcase the more customizable format of training. 

As usual, you begin with your imports (see Figure B-3 and Figure B-4). 


364 


import 
import 
import 
import 
import 
import 


import 


device 


Figure B-3. 


In [1]: i 


Figure B-4. 


APPENDIX B INTRO TO PYTORCH 


torch 

torch.nn as nn 

torchvision 

torchvision.transforms as transforms 
torch.optim as optim 
torch.nn.functional as F 


numpy as np 


= Corch.device|*cudaz:0" 2E torch.cuda.is ovailebplet) 
else 'cpu') 





Importing the basic modules needed to create your network 


import torch 


import torch.nn as nn 

import torchvision 

import torchvision.transforms as transforms 
import torch.optim as optim 


import torch.nn.functional as F 


import numpy as np 


device = torch.device("cuda:0" if torch.cuda.is available() else ‘cpu') 


The code in Figure B-3 in a Jupyter cell 


In Chapter 3, the code was introduced in a manner similar to basic Keras formatting, 


so you defined the hyperparameters and loaded your data sets (data loaders in this case) 


right after importing the modules you need. 


Instead, you will now define the model (see Figure B-5 and Figure B-6). 


365 


APPENDIX B INTRO TO PYTORCH 


class CNN(nn.Module): 
def init (self): 
SUper{CNN, self). anit {) 
self.convl = nn.Conv2d(1, 32, 3, 1) 
self.conv2 nn.Convzd(3z;, 64, 3, 1) 
self.densel Nnshineariiz*172*64, 122) 


self.dense2 nn. Linéear(12¢6, num classes) 


forward(self, x): 
X F.relu(self.convl (x) ) 


x F.relu(self.conv2 (x) ) 


Femax: poolZd(x;, zz 2) 


sGropout (x, 0.25) 
.view(-1l, 12*12*64) 
F.relu(self.densel (x) ) 
4 PeOropour (x, 045) 
x self.dense2 (x) 


return F.log softmax(x, dim=1) 





Figure B-5. Defining the model 


366 


In [2]: 


def 


as es 


APPENDIX B- INTRO TO PYTORCH 
class CNN(nn.Module): 
_ init _ (self): 
super(CNN, self). init  () 
self.convl = nn.Convzd(1, 32, 3, 1) 


nn.Conv2d(32, 64, 3, 1) 
nn.Linear(12*12*64, 128) 
nn.Linear(1i26, num classes) 


self.conv2 = 
selft.densel = 
self.densez = 


forward(self, x): 
F.relu(self.convl (x) ) 
F.relu(self.conv2 (x) ) 
F.max pool2d(x, 2, 2) 
F.dropout(x, 0.25) 
x.view(-1, i12*i2*64) 
F.relu(self.densel (x) ) 
F.dropout (x, 0.5) 
self.dense2 (x) 

return F.log softmax(x, dim=1) 


x - MS M MR OM ROM 


Figure B-6. The code in Figure B-5 in a Jupyter cell 


With that out of the way, you can define both the training and testing functions (see 


Figure B-7 and Figure B-8 for the training function, and Figure B-9 and Figure B-10 for 


the testing function). 


367 


APPENDIX B INTRO TO PYTORCH 


def train(model, device, train loader, criterion, optimizer, epoch, 
pave dir='model.ckpt*): 


cLotal step — leniirein loader) 


for 1; (images, labels) in enumerate (train loader): 


images = images.to(device) 


labels labels.to (device) 


# Forward pass 
outputs = model (images) 


loss = criterion(outputs, labels) 


# Backward and optimize 
Optimizer.zero grad () 
loss.backward () 


optimizer.step() 


if 100 == 0: 


prant ("Epoch [{}7t}il>y Seep [4 }7i}l, Loss: 
Lis4i} «formar (epochrl, mum epochs; atl, totel step, loss.i1tem() ):) 


Lorch. save (model .slate dictty, "pytorch mnist.cnhn.ckpr") 





Figure B-7. The training algorithm. The for loop takes each pair of image and 
labels and passes them into the GPU as a tensor. They then go into the model, and 
the gradients are calculated. The information about the epoch and loss are then 
output 


368 


In [6]: 


APPENDIX B INTRO TO PYTORCH 


def train(model, device, train_loader, eriterion, optimizer, epoch, save _dir="model.ckpt"): 
total step = len(train_loader) 
for i, {images, labels) in enumerate(train loader): 
images = images.to (device) 
labels labels.to (device) 


ot 


# Forvard pass 
outputs = model (images) 
loss = criterion(outputs, labels) 


# Backvard and optimize 
optimizer.zero_grad() 
loss.backward () 
optimizer.step() 
if (i+1) % 100 = 0: 
print ("Epoch [{}/{}], Step [{}/{}], Loss: {:.4£}' 


- format (epoch+l, num_epochs, i+l, total step, loss.item())) 


torch.save (model.state dict(), ‘pytorch mnist cnn.ckpt") 


Figure B-8. The code in Figure B-7 in a Jupyter cell 


The training function takes in the following parameters: 


model: An instance of a model class. In this case, it’s an instance of 
the CNN class defined above. 


device: This basically tells PyTorch what device (if the GPU is an 
option, which GPU to run on, and if not, the CPU is the device) to run 
on. In this case, you define the device right after the imports. 


train_loader: The loader for the training data set. In this case, you 
use a data_loader because that’s how the MNIST data is formatted 
when importing from torchvision. This data loader contains the 
training samples for the MNIST data set. 


criterion: The loss function to use. Define this before calling the train 


function. 


optimizer: The optimization function to use. Define this before 
calling the train function. 


epoch: What epoch is running. In this case, you call the training 
function in a for loop while passing in the iteration as the epoch. 


The testing function is shown in Figure B-9 and Figure B-10. 


369 


APPENDIX B INTRO TO PYTORCH 


from sklearn.metrics import roc auc_score 


def test(model, device, test loader): 


preds = [] 
y true = [1 
# Test the model 
model.eval () # Set model to evaluation mode. 
with torch.no grad(): 
correct = 0 
total = 0 
£or images; labels an test loader: 
images = images.to (device) 
labels labels.to (device) 
outputs = model (images) 
ue¢ Pedi cred = Lorcni.max (ourpucs.data, 1) 
total += labels.size(0O) 
correct += (predicted == labels) .sum().item() 
detached pred = predicted.detach().cpu() .numpy () 
detached label = labels..detach():.cpu() «numpy () 
for £ an range (0, len(decached. pred) ): 
preds.append(detached pred! f]) 


y true append (detached label. [rf] ) 


print ('Test Accuracy of the model on the 10000 test images: 
{:,2%}".format (correct / total) ) 


preds = fip.eye (mum classés) ([preds] 
y true = np.Gye(num classes) |y true] 
auc = boc auc score (preds, y true) 


Crip "AUC! (face ".roOrmat Laue) ) 





Figure B-9. The code for the testing algorithm. Once again, the for loop takes 
the image and label pairs and passes them through the model to get a prediction. 
Then, once every pair has a prediction, the AUC score is calculated 


370 


APPENDIX B INTRO TO PYTORCH 


In [7]: | from sklearn.metrics import roc auc score 
def test(model, device, test_loader): 
preds = [] 
y_true = [) 
# Test the model 
model.eval() # Set model to evaluation mode. 
with torch.no grad(): 
correct = 0 
tetal = 0 
for images, labels in test_loader: 
images = images.to(device) 
labels = labels.to(device) 
outputs = model (images) 
_, predicted = torch.max(outputs.data, 1) 
total += labels.size(0) 
correct += (predicted == labels) .sum().item() 
detached pred = predicted.detach().cpu() .numpy() 
detached label = labels.detach().cpu().numpy() 
for f in range(0, len(detached_pred)): 
preds.append (detached pred[f]) 
y_true.append(detached label/[f)) 
print ('Test Accuracy of the model on the 10000 test images: {:.2%}'.format(correct / total)) 
preds = np.eye(num_classes) [preds) 
y_true = np.eye(num_classes) [y true] 
auc = roc auc score(preds, y true) 
print("AUC: {:.2%}".format (auc)) 


Figure B-10. The code in Figure B-9 in a Jupyter cell 


Notice that you use the AUC score as part of the testing metric. You don’t have to do 
this, but it might be a better indicator of the model’s performance than plain accuracy, so 
it was included in this example. 

The parameters the model takes in are 


e model: An instance of a model class. In this case, it’s an instance of 
the CNN class defined above. 


e device: This basically tells PyTorch what device (if the GPU is an 
option, which GPU to run on, and if not, the CPU is the device) to run 


on. In this case, you define the device right after the imports. 


e test_loader: The loader for the testing data set. In this case, you use 
a data_loader because that’s how the MNIST data is formatted when 
importing from torchvision. This data loader contains the testing 
samples for the MNIST data set. 


Now you can get to defining your hyperparameters and data loaders, and calling 


your train and test functions (Figures B-11 through B-13). 


371 


APPENDIX B INTRO TO PYTORCH 


# Hyperparameters 
hum. epochs. = 15 

Wm classes = 1.0 
batch size = 128 


learning rate = 04001. 


# Load MNIST data set 
train dataset = torchvision.datasets.MNIST (root='../../data/', 


train=True, 


transform=transforms.ToTensor(), 


download=True) 


test dataset = torchvision.datasets.MNIST(root='../../data/', 


train=False, 


transform=transforms.ToTensor() ) 


# Data loader 
train toader — TOrch.tils.data.Datahoader (daetaset—train dataset, 
batch. eize=batch size, 


shuffle=True) 


Lest Loader = Torch. uri s.data,Dataloader(decasst=tese Catraser, 
batch size=batch size, 


shuffle=False) 





Figure B-11. Defining the hyperparameters, loading the MNIST data, and 
defining the training and testing set data loaders 


372 


APPENDIX B INTRO TO PYTORCH 


model = CNN().to(device) 


criterion = nn.CrossEntropyLoss () 


optimizer torch.optim.Adam(model.parameters(), lr=learning rate) 


## Training phase 


for epoch in range (0, num epochs): 


train(model, device, train loader, criterion, optimizer, epoch) 


## Testing phase 


test (model), device, test loader) 





Figure B-12. Initializing the model and passing it to the GPU, defining your 
criterion function (cross entropy loss), and defining your optimizer (the Adam 
optimizer). Then, the training and testing functions are called 


373 


APPENDIX B 


In [8]: 


r oO’ wn 


INTRO TO PYTORCH 


D> 


#Hyperparameters 
num_epochs = 15 
num_classes = 10 
batch_size = 128 
learning rate = 0.001 


#Load MNIST data set 

train dataset = torchvision.datasets.MNIST(root='../../data/', 
train=True, 
transform=transforms.ToTensor(), 
download=True) 


test_dataset = torchvision.datasets.MNIST(root='../../data/', 
train=False, 
transform=transforms.ToTensor()) 


#Data loader 

train loader = torch.utils.data.DataLoader (dataset=train dataset, 
batch_size=batch_size, 
shuffle=True) 


test_loader = torch.utils.data.DataLoader (dataset=test_dataset, 


batch_size=batch_size, 
shuffle=False) 


model = CNN().to(device) 
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) 


## Training phase 


for epoch in range(0, num_epochs): 
train(model, device, train_loader, criterion, optimizer, epoch) 


## Testing phase 


test(model, device, test_loader) 


Figure B-13. What the code from Figures B-11 and B-12 should look like after 
pasting them into a Jupyter cell 


374 


Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 


[1/15], 
[1/15], 
[1/15], 
[1/15], 
[2/15], 
[2/15], 
[2/15], 
[2/15], 
[3/15], 
[3/15], 
[3/15], 
[3/15], 
[4/15], 
[4/15], 
[4/15], 
[4/15], 
[5/15], 


Epoch 
Epoch 
Epoch 
Epoch 
Epoch 


Dias = = 


[5/15], 
[5/15], 
[5/15], 
[6/15], 
[6/15], 


re fact 


step 
step 
Step 
step 
Step 
step 
otep 
step 
step 
step 
step 
step 
step 
otep 
Step 
step 
step 
step 
step 
Step 
step 
step 


Cc} == 


[100/469], 
[200/469], 
[300/469], 
[400/469], 
[100/469], 
[200/469], 
[300/469], 
[400/469], 
[100/469], 
[200/469], 
[300/469], 
[400/469], 
[100/469], 
[200/469], 
[300/469], 
[400/469], 
[100/469], 
[200/469], 
[300/469], 
[400/469], 
[100/469], 
[200/469], 


ronan faAcnt 


Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 
Loss: 


T = 2 


Figure B-14. The initial output of the training process 


Epoch [12/15], Step (200/464), Loss: 0.0017 
Epoch [12/15], Step [300/469], Loss: 0.0006 
Epoch [12/15], Step [400/469], Loss: 0.0026 
Epoch [12/15], Step [100/469], Loss: 0.0010 
Epoch [123/15], Step [200/469], Loss: 0.0002 
Epoch [13/15], Step [300/469], Loss: 0.0025 
Epoch [13/15], Step [400/469], Loss: 0.0014 
Epoch [14/15], Step [100/469], Loss: 0.0000 
Epoch [14/15], Step [200/469], Loss: 0.0000 
Epoch [14/15], Step [300/469], Loss: 0.0001 
Epoch [14/15], Step [400/469], Loss: 0.0064 
Epoch [15/15], Step [100/469], Loss: 0.0024 
Epoch [15/15], Step [200/469], Loss: 0.0000 
Fpoch [15/15], Step [300/469], Loss: 0.0017 
Epoch [15/15], Step [400/469], Loss: 0.0003 
Test Accuracy of the model on the 10000 test images: 95.93% 
AUC: 99.40% 


Figure B-15. The training process has finished 


APPENDIX B 


After the training process, you get Figure B-14 and Figure B-15. 


0.1852 
0.0556 
0.1524 
0.0332 
0.0494 
0.1003 
0.0534 
0.0329 
0.0387 
0.0379 
0.0055 
0.0183 
0.0283 
0.0250 
0.0210 
0.0637 
0.0109 
0.0091 
0.0243 
0.0095 
0.0310 
0.0254 


A Aan 


INTRO TO PYTORCH 


Although in your Keras examples you didn’t spread apart your training and testing 


functions (since they’re just one line each), more complicated implementations of 


models involving custom layers, models, and so on can be formatted in a similar fashion 


to the PyTorch example above. 


375 


APPENDIX B INTRO TO PYTORCH 


Hopefully, you understand a bit more on how to implement, train, and test neural 
networks in PyTorch. 

Next, we will explain some of the basic functionality that PyTorch offers in terms 
of model layers (activations included), loss functions, and optimizers, and then you'll 
explore PyTorch applications of temporal convolutional neural networks to the data set 


found in Chapter 7. 


Sequential vs. ModuleList 


Similar to Keras, PyTorch has a couple different ways to define the model. 


Sequentially, as in Figure B-16 


in i 3 ## sequential 
t 


import torch.nn as nn 


model = nn.Sequential ( 
nn.Conv2Zd(i, 32, 3, 1), 
nn.ReLuU(), 
nn.Conv2d(32, 64, 3, 1), 
nn.Sigmoid() 
) 


Figure B-16. A sequential model in PyTorch 


This is similar to the sequential model in Keras, where you add layers one at a time 
and in order. 
ModuleList, as in Figure B-17 


376 


APPENDIX B INTRO TO PYTORCH 


## ModuleList 


13 elass ModuleListModel (nn.Module): 

14 def init (self): 

is super (ModuleListModel, self). init  () 
1€ self.conv_1 = nn.Convzd(1, 32, 3, 1) 

17 self.conv_2 = nn.Convdd(32z, 64, 3, 1) 
1é self.dense 1 = nn.Linear(64*64, 128) 


self.output = nn.Linear(i28, n_classes) 


def forward(self, x): 

nn.functional.relu(self.conv 1(x)) 
nn.functional.relu(self.conv 2(x)) 
nn.functional.max pool2d(x, 2, 2) 
nn.functional.dropout(x, 0.25) 

x.View(-1, 64*64) 
nn.functional.reu(self.dense 1(x)) 
nn.functional.dropout(x, 0.5) 

nn.functional.log softmax(self.output(x), dim=1) 
return x 


nud wt wu 


x Me MM MOM OM OM 


| 


model = ModuleListModel () .to(dewice) 
Figure B-17. A model in PyTorch defined in a ModuleList format 


This is similar to the functional model that you can build in Keras. This is a more 
customizable way to build your model, and allows you much more flexibility in how you 
want to build it too. 


Layers 


We've covered how to build the models, so let’s look at examples of some common layers 
you can build. 


Convid 
torch.nn.Convid() 


Check out Chapter 7 for a detailed explanation on how one-dimensional 


convolutions work. 


377 


APPENDIX B INTRO TO PYTORCH 


This layer is a one-dimensional (or temporal) convolutional layer. It basically passes 
a filter over the one-dimensional input and multiplies the values element-wise to create 
the output feature map. 

These are the parameters that the function takes: 


e in_channels: The dimensionality of the input space; the number of 
input nodes. 


e out_channels: The dimensionality of the output space; the number 
of output nodes. 


e kernel_size: The dimensionality of the kernel/filter. An integer n 
makes the dimensions of the kernel nxn, and a tuple of two integers 
allows you to specify the exact dimensions (height, width). 


e stride: The number of elements to shift right by after one filter/ 
kernel operation. An integer n makes the kernel shift right by that 
amount. A tuple of two integers allows you to specify (vertical_shift, 
horizontal_shift). Default = 1. 


e padding: The amount of zero padding to add to the layer in the 
output. An integer n pads n entries to the rows and columns. A tuple 
of two integers allows you to specify (vertical_padding, horizontal 
padding). Default = 0. 


e dilation: For an explanation on how dilation works, refer to Chapter 7. 
An integer n means a dilation factor of n. Default = 1. 


e groups: Controls the connections between the input and output 
nodes. Groups=1 means all inputs correlate with all outputs. 
Groups=2 means there’s really two convolutional layers side by side, 
so half the inputs go to half the outputs. Default = 1. 


e bias: Whether or not to use bias. Default = True. 


Conv2d 
torch.nn.Conv2d() 


Check out Chapter 3 for a detailed explanation on how the 2D convolutional layer 


works. 


378 


APPENDIX B INTRO TO PYTORCH 


This layer is a two-dimensional convolutional layer. It basically passes a 2D filter over 


the input and multiplies the values element-wise to create the output feature map. 


These are the parameters that the function takes: 


in_channels: The dimensionality of the input space; the number of 
input nodes. 


out_channels: The dimensionality of the output space; the number 
of output nodes. 


kernel_size: The dimensionality of the kernel/filter. An integer n 
makes the dimensions of the kernel nxn, and a tuple of two integers 
allows you to specify the exact dimensions (height, width). 


stride: The number of elements to shift right by after one filter/ 
kernel operation. An integer n makes the kernel shift right by that 
amount. A tuple of two integers allows you to specify (vertical_shift, 
horizontal_shift). Default = 1. 


padding: The amount of zero padding to add to the layer in the 
output. An integer n pads n entries to the rows and columns. A tuple 


of two integers allows you to specify (vertical_padding, horizontal_ 


padding). Default = 0. 


dilation: For an explanation on how dilation works, refer to Chapter 7. 


An integer n means a dilation factor of n. A tuple of two integers 
allows you to specify (vertical_dilation, horizontal_dilation). 
Default = 1. 


groups: Controls the connections between the input and output 
nodes. Groups=1 means all inputs correlate with all outputs. 
Groups=2 means there’s really two convolutional layers side by side, 
so half the inputs go to half the outputs. Default = 1. 


bias: Whether or not to use bias. Default = True. 


379 


APPENDIX B INTRO TO PYTORCH 


Linear 
torch.nn.Linear() 


This is a neural network layer comprised of densely-connected neurons. Basically, 
every node in this layer is fully connected with the previous and next layers if there are any. 
Here are the parameters: 


e in_features: The size of each input sample; number of inputs. 
e out_features: The size of each output sample; number of outputs. 


e bias: Whether or not to use bias. Default = True. 


MaxPooling1D 
torch.nn.MaxPool1d() 


This applies max pooling on a 1D input. To get a better idea of how max pooling 
works, check out Chapter 3. Max pooling in 1D is similar to max pooling in 2D, except 
the sliding window only works in one dimension, going from left to right. 

This is the list of parameters: 


e kernel_size: The size of the pooling window. If an integer n is given, 
then the window size of the pooling layer is 1xn. 


e stride: Defaults to kernel_size if nothing is passed in. If you pass in 
an integer, the pooling window moves by integer n amount after 


completing its pooling operation on a set of entries. 


e padding: An integer n representing the zero padding to add on both 
sides. Default = 0. 


e dilation: Similar to the dilation factor in the convolutional layer, 
except with max pooling. Default = 1. 


e return_indices: If set to True, it will return the indices of the max 
values along with the outputs. Default = False. 


e ceil_mode: If set to True, it will use ceil instead of floor to compute 
the output shape. This comes into play because of the dimensionality 
reduction involved (a kernel size of n will reduce dimensionality by a 
factor of n). 


380 


APPENDIX B INTRO TO PYTORCH 


MaxPooling2D 
torch.nn.MaxPool2d() 


It applies max pooling on a 2D input. To get a better idea of how max pooling works, 
check out Chapter 3. 
This is the list of parameters: 


e kernel_size: The size of the pooling window. If an integer n is given, 
then the window size of the pooling layer is Lxn. A tuple of two 
integers allows you to specify the dimensions as (height, width). 


e stride: Defaults to kernel_size if nothing is passed in. If you pass in 
an integer, the pooling window moves by integer n amount after 
completing its pooling operation on a set of entries. A tuple of two 
integers allows you to specify (vertical_shift, horizontal_shift). 


e padding: An integer n representing the zero padding to add on both 
sides. A tuple of two integers allows you to specify (vertical_padding, 
horizontal_padding). Default = 0. 


e dilation: Similar to the dilation factor in the convolutional layer, 
except with max pooling. An integer n means a dilation factor of 
n. A tuple of two integers allows you to specify (vertical_dilation, 
horizontal_dilation). Default = 1. 


e return_indices: If set to True, it will return the indices of the max 
values along with the outputs. Default = False. 


e ceil_mode: If set to True, it will use ceil instead of floor to compute 
the output shape. This comes into play because of the dimensionality 
reduction involved (a kernel size of n will reduce dimensionality by a 
factor of n). 


ZeroPadding2D 
torch.nn.ZeroPad2d() 


Depending on the input, it pads the input sequence with a row and columns of 
zeroes at the top, left, right, and bottom of the image tensor. 


381 


APPENDIX B INTRO TO PYTORCH 


Here is the parameter: 


e padding: An integer or a tuple of four integers. The integer tells it 
to add n rows of zeroes on the top and bottom of the image tensor, 
and n columns of zeroes. The tuple of four integers is formatted as 
(padding left, padding right, padding top, padding bottom), so 
you can customize even more how you want the layer to add rows or 


columns of zeroes. 


Dropout 
torch.nn.Dropout () 


What the dropout layer does in PyTorch is take the input and randomly zeroes the 
elements according to some probability p using samples from a Bernoulli distribution. 
This process is random, so with every forward pass through the model, different 
elements will be chosen to be zeroed. This process helps with regularization of layer 
outputs and helps combat overfitting. 

Here are the parameters: 


e p: The probability of an element to be zeroed. Default = 0.5 


e inplace: If set to True, it will perform the operation in place. 
Default = False. 


You can define this as a layer within the model itself, or apply dropout in the forward 


function like so: 
torch.nn.functional.Dropout(input, p = 0.5, training=False, inplace=False) 


Input is the previous layer, and training is a parameter that determines whether or not 
you want this dropout layer to function outside of training (such as during evaluation). 
Figure B-18 shows an example of how you can use this layer in the forward function. 


In [ ]: | def forward(self, x): 
: x = nn.functional.relu(self.conv_1(x)) 
4 
zx 


nn.functional.dropout(x, 0.25) 
nn.functional.relu(self.conv_2(x)) 


i ot 


Figure B-18. The dropout layer in the forward function of a model 


382 


APPENDIX B INTRO TO PYTORCH 


So with dropout, you have two ways of applying it, both producing similar outputs. In 
fact, the layer itself is an extension of the functional version of dropout, which itself is an 
interface. This is really up to personal preference, since both are still dropout layers and 
there’s no real difference in behavior. 


ReLU 
torch.nn.ReLU() 


ReLU, or “Rectified Linear Unit’, performs a simple activation based on the function, 
as shown in Figure B-19. 


f(x) = max(0, x) 


Figure B-19. The general formula that ReLU follows 


Here is the parameter: 


e inplace: If set to True, it will perform the operation in place. 
Default = False. 
For ReLU, the graph can look like Figure B-20. 






y=xforallx>0O 


y=Oforallx<0O 


Figure B-20. The general graph of a ReLU function 


383 


APPENDIX B INTRO TO PYTORCH 


Similarly to dropout, you can define this as a layer within the model itself, or apply 
ReLU in the forward function like so: 


torch.nn.functional.relu(input, inplace=False) 


Input is the previous layer. 

Figure B-21 shows an example of how you can use this layer in the forward function. 

Just like with dropout, you have two ways of applying ReLU, but it all boils down to 
personal preference. 


In [ ]: | def forward(self, x): 

x = nn.functional.relu(self.conv_1(x)) 
x = nn.functional.dropout(x, 0.25) 
x = nn.functional.relu(self.conv_2(x)) 


Figure B-21. The ReLU layer in the forward function of a model 


softmax 
torch.nn.Softmax() 


This performs a softmax on the given dimension. 
The general formula for softmax is shown in Figure B-22 (K is the number of 
samples). 
evi 
= ———EEES i-— = k 
a(x); T Oe, For i = 1,...,K and x = (x, ..., X,) E R 


Figure B-22. The general formula for softmax. The parameter i goes up until the 
total number of samples, which is K 


Here is the parameter: 


e dim: The dimension to compute softmax along, determined by some 
integer n. This is so every slice along the dimension will sum to 1. 
Default = None. 


You can define this as a layer within the model itself, or apply softmax in the forward 
function like so: 


torch.nn.functional.softmax(input, dim=None, _stacklevel=3) 


384 


APPENDIX B INTRO TO PYTORCH 


Input is the previous layer. 
Figure B-23 shows an example of how you can use this layer in the forward function. 


In [ ]: def forward(self, x): 
x = nn.functional.dropout(x, 0.5) 


x = nn.functional.softmax(self.dense 1(x), dim=1) 
return x 


Figure B-23. The softmax layer in the forward function of a model 


However, this doesn’t work well if you’re using NLLL (negative log likelihood) loss, in 
which case you should use log_softmax instead. 


Log Softmax 
torch.nn.LogSoftmax() 


This performs a softmax activation on the given dimension, but passes that through 
a log function. 
The general formula for log_softmax is shown in Figure B-24 (K is the number of 
samples). 
a 


a(x); = log (=| For i = 1,...,K and x = (x, ..., X,) € R* 


jare 


Figure B-24. The general formula for log_softmax. The value i goes up until the 
total number of samples, K. 


Here is the parameter: 


e dim: The dimension to compute softmax along, determined by some 
integer n. This is so every slice along the dimension will sum to 1. 
Default = None. 
You can define this as a layer within the model itself, or apply softmax in the forward 


function like so: 
torch.nn.functional.log softmax(input, dim=None, _stacklevel=3) 


Input is the previous layer. 


385 


APPENDIX B INTRO TO PYTORCH 


Figure B-25 shows an example of how you can use this layer in the forward function. 


in. [33 1 def forward(self, x): 
x = nn.functional.dropout(x, 0.5) 


x = nn.functional.log softmax(self.dense_1(x), dim=1) 
return x 


Figure B-25. The log softmax layer in the forward function of a model 
Sigmoid 
torch.nn.Sigmoid() 


This performs a sigmoid activation. 

The sigmoid function does have its uses, primarily because it forces the input to be 
between 0 and 1, but it is prone to the vanishing gradient problem, and so it is seldom 
used in hidden layers. 

There are no parameters, so it’s a simple function to call. 

To get an idea of what the equation is like when graphed, refer to Figure B-26. 





Figure B-26. The general graph of a sigmoid function 


386 


APPENDIX B INTRO TO PYTORCH 


You can define this as a layer within the model itself, or apply sigmoid in the forward 


function like so: 
torch.nn. functional.sigmoid(input) 


Input is the previous layer. 
Figure B-27 shows an example of how you can use this layer in the forward function. 


In [ ]: def forward(self, x): 
x = nn.functional.dropout(x, 0.5) 


x = nn.functional.sigmoid(self.dense 1(x), dim=1) 
return x 


Figure B-27. The sigmoid layer in the forward function of a model 


Loss Functions 
MSE 


torch.nn.MSELoss() 


If you have questions on the notation for this equation, refer to Chapter 3. The 
equation is shown in Figure B-28. 


1s | | 
J(@) = — » (ho(x!) - y')? 


Figure B-28. The general formula for mean squared loss 


Given input 0, the weights, the formula finds the average difference squared between 
the predicted value and the actual value. The parameter h, represents the model with the 
weight parameter 0 passed in, so h,(x') would give the predicted value for x’ with model’s 
weights @. The parameter y’ represents the actual prediction for the data point at index i. 
Lastly, there are n entries in total. 

This function has several parameters (two are deprecated): 


e size_average: (Deprecated in favor of reduction.) The losses are 
averaged over each loss element in the batch by default (True). If 
set to False, then the losses are summed for each minibatch instead. 
Default = True. 


387 


APPENDIX B INTRO TO PYTORCH 


e reduce: (Deprecated in favor of reduction.) The losses are averaged 
or summed over observations for each minibatch depending on 
size_average by default (True). If set to False, then it returns a loss per 
batch element and ignores size_average. Default = True. 


e reduction: A string value to specify the type of reduction to be done. 
Choose between ‘none; ‘elementwise_mean, or ‘sum: ‘none’ means 
no reduction is applied, ‘elementwise_mean’ will divide the sum of 
the output by the number of elements in the output, and ‘sum’ will 
just sum the output. Default = ’elementwise_mean‘ Note: specifying 


either size_average or reduce will override this parameter. 


This loss metric can be used in autoencoders to help evaluate the difference between 
the reconstructed output and the original. In the case of anomaly detection, this metric 
can be used to separate the anomalies from the normal data points, since anomalies 


have a higher reconstruction error. 


Cross Entropy 
torch.nn.CrossEntropyLoss() 


The equation is shown in Figure B-29. 


1 Tl 
J(@) = = —)_ y,* log hex) + (1 = y,) * log(1 = h(x) 
i=0 


Figure B-29. The general formula for cross entropy loss 


In this case, n is the number of samples in the whole data set. The parameter hy, 
represents the model with the weight parameter 0 passed in, so h,(x;) would give the 
predicted value for x; with model’s weights 0. Finally, y; represents the true labels for 
data point at index i. The data needs to be regularized to be between 0 and 1, so for 
categorical cross entropy, it must be piped through a softmax activation layer. 

The categorical cross entropy loss is also called softmax loss. 

Equivalently, you can write the previous equation as Figure B-30. 


388 


APPENDIX B INTRO TO PYTORCH 


r 


1 | rl 
J(@) =- -) » Viz * log (No (Xi) 


i=0 j=0 


Figure B-30. An alternate way to write the equation in Figure B-29 


In this case, m is the number of classes. 
The categorical cross entropy loss isa commonly used metric in classification tasks, 
especially in computer vision with convolutional neural networks. 


This function has several parameters (two are deprecated): 


e weight: (Optional) A tensor that’s the size of the number of classes 
n. This is essentially a weight given to each class so that some classes 
are weighted more heavily in how they affect the overall loss and 


optimization problem. 


e size_average: (Deprecated in favor of reduction.) The losses are 
averaged over each loss element in the batch by default (True). If 
set to False, then the losses are summed for each minibatch instead. 
Default = True. 


e ignore_index: (Optional) An integer that specifies a target value 
that is ignored so it does not contribute to the input gradient. If 
size_average is True, then the loss is averaged over targets that aren’t 
ignored. 


e reduce: (Deprecated in favor of reduction.) The losses are averaged 
or summed over observations for each minibatch depending on size_ 
average by default (True). If set to False, it returns a loss per batch 
element and ignores size_average. Default = True. 


e reduction: A string value to specify the type of reduction to be done. 
Choose between ‘none; ‘elementwise_mean; or ‘sum: ‘none’ means 
no reduction is applied, ‘elementwise_mean’ will divide the sum of 
the output by the number of elements in the output, and ‘sum’ will 
just sum the output. Default = ’elementwise_mean‘ Note: Specifying 


either size_average or reduce will override this parameter. 


389 


APPENDIX B INTRO TO PYTORCH 
Optimizers 
SGD 


torch.optim.SGD() 


This is the stochastic gradient descent optimizer, a type of algorithm that aids in 
the backpropagation process by adjust the weights. It is commonly used as a training 
algorithm in a variety of machine learning applications, including neural networks. 


This function has several parameters: 


e params: Some iterable of parameters to optimize, or dictionaries with 


parameter groups. This can be something like model. parameters (). 
e Ir: A float value specifying the learning rate. 


¢ momentum: (Optional) Some float value specifying the momentum 
factor. This parameter helps accelerate the optimization steps in the 
direction of the optimization, and helps reduce oscillations when 
the local minimum is overshot (refer to Chapter 3 to refresh your 
understanding on how a loss function is optimized). Default = 0. 


e weight_decay: A 12_penalty for weights that are too high, helping 
incentivize smaller model weights. Default = 0. 


e dampening: The dampening factor for momentum. Default = 0. 


e nesterov: A Boolean value to determine whether or not to apply 
Nesterov momentum. Nesterov momentum is a variation of 
momentum where the gradient is computed not from the current 
position, but from a position that takes into account the momentum. 
This is because the gradient always points in the right direction, 
but the momentum might carry the position too far forward and 
overshoot. Since this doesn’t use the current position but instead 
some intermediate position that takes into account momentum, the 
gradient from that position can help correct the current course so 
that the momentum doesn’t carry the new weights too far forward. 
It essentially helps for more accurate weight updates and helps 
converge faster. Default = False. 


390 


APPENDIX B INTRO TO PYTORCH 


Adam 
torch. optim. Adam() 


The Adam optimizer is an algorithm that extends upon SGD. It has grown quite 
popular in deep learning applications in computer vision and in natural language 
processing. 


This function has several parameters: 


e params: Some iterable of parameters to optimize, or dictionaries 
with parameter groups. This can be something like model. 
parameters(). 


e Ir: A float value specifying the learning rate. Default = 0.001 (or le-3). 


e betas: (Optional) A tuple of two floats to define the beta values 
beta_1 and beta_2. The paper describes good results with (0.9, 0.999) 
respectively, which is also the default value. 


e eps: (Optional). Some float value where epsilon e >= 0. Epsilon is 
some small number, described as 10E-8 in the paper, to help prevent 
division by 0. Default is 1e-8. 


e weight_decay: A 12_penalty for weights that are too high, helping 
incentivize smaller model weights. Default = 0. 


e amsgrad: A Boolean on whether or not to apply the AMSGrad 
version of this algorithm. For more details on the implementation 
of this algorithm, check out “On the Convergence of Adam and 
Beyond.” Default=False. 


RMSProp 


torch.optim.RMSprop() 


RMSprop is a good algorithm for recurrent neural networks. RMSprop is a gradient- 
based optimization technique developed to help address the problem of gradients 
becoming too large or too small. RMSprop helps combat this problem by normalizing 
the gradient itself using the average of the squared gradients. In Chapter 7, it’s explained 
that one of the problems with RNNs is the vanishing/exploding gradient problem, 


391 


APPENDIX B INTRO TO PYTORCH 


leading to the development of LSTMs and GRU networks. And so it’s of no surprise that 
RMSprop pairs well with recurrent neural networks. 
This function has several parameters: 


e params: Some iterable of parameters to optimize, or dictionaries 
with parameter groups. This can be something like model. 
parameters(). 


e Ir: A float value specifying the learning rate. Default = 0.01 (or le-2). 


¢ momentum: (Optional). Some float value specifying the momentum 
factor. This parameter helps accelerate the optimization steps in the 
direction of the optimization, and helps reduce oscillations when 
the local minimum is overshot (refer to Chapter 3 to refresh your 
understanding on how a loss function is optimized). Default = 0. 


e alpha: (Optional) A smoothing constant. Default = 0.99 


e eps: (Optional). Some float value where epsilon e >= 0. Epsilon is 
some small number, described as 10E-8 in the paper, to help prevent 
division by 0. Default is 1e-8. 


e centered: (Optional) If True, then compute the centered RMSprop 
and have the gradient normalized by an estimation of its variance. 
Default = False. 


e weight_decay: An |2_penalty for weights that are too high, helping 
incentivize smaller model weights. Default = 0. 


Hopefully by now you understand how PyTorch works by looking at some of the 
functionality that it offers. You built and applied a model to the MNIST data set in an 
organized format, and you looked at some of the basics of PyTorch by learning about the 
layers, how models are constructed, how activations are performed, and what the loss 


functions and optimizers are. 


Temporal Convolutional Network in PyTorch 


Now, you will look at an example of using PyTorch to construct a temporal convolutional 
network and apply it to the credit card data set from Chapter 7. 


392 


APPENDIX B INTRO TO PYTORCH 


Dilated Temporal Convolutional Network 


The particular TCN you will reconstruct in PyTorch is the dilated TCN in Chapter 7. 
Once again, you begin with your imports and define your device (Figure B-31 and 
Figure B-32). 


import numpy as np 

import pandas as pd 

import keras 

from sklearn.model selection import train test split 
from sklearn.preprocessing.data import StandardsScaler 
import torch 

import torch.nn as nn 


import torch.nn.functional as F 


# Hyperparameters 


num epochs = 30 


num classes = 2 


Learning 2ars = 0,002 


device = torch.déevice('cuda:0" if torch.cuda.is available({) else 
‘cpu ) 





Figure B-31. Importing the necessary modules 


In [14]: 1 import numpy as np 

2 import pandas as pd 
from sklearn.model selection import train_test_split 
from sklearn.preprocessing.data import StandardScaler 
import torch 
import torch.nn as nn 
import torch.nn.functional as F 


¥ Hyperparameters 
num epochs = 30 
li num _classes = 2 


learning rate = 0.002 


device = torch.device('cuda:0" if torch.cuda.is available() else 'cpu') 


Figure B-32. The code in Figure B-31 in a Jupyter cell 
393 


APPENDIX B INTRO TO PYTORCH 


Next, you load your data set (Figure B-33). 


= pd.read csv ("datasets/creditcardfraud/creditcard.csv", 
sep=",", index col=None) 


print (df.shape) 


df.head() 


Figure B-33. Loading your data set and displaying the first five rows 


The output should look somewhat like Figure B-34. 


In [2]: 1 df = pd.read_cav("datasets/creditcardfraud/creditcard. 


print (df.shape) 
df.head ()} 


(264807, 


Out [2]: 
Time 


o 0.0 
0.0 


= 


1.0 
1.0 


& tt hi 


2.0 


31) 


Vi V2 
“1.359807 -0.072781 
1.191857 0.266151 
“1.358354 = -1.340163 
“0.966272 -0.185226 
“1.158233 O.B7 7737 


3 rows * 31 columns 


< 


V3 
2.536347 
0.166480 
1.7 F3209 
1.792993 
1.548718 


V4 
1.378155 
0.448154 


0.379780 


0.863291 


0.403034 


V5 


0.338321 


0.060018 


-0.503198 
-0,010309 
=0. 4077193 


V6 


0.462388 


-0.082361 


1.800499 
1.247203 
0.095921 


Figure B-34. The output of the code in Figure B-33 





csv", sep=",", index_col=None) 


v7 
0.239599 
-0.078803 
0.791461 
0.237609 
0.592941 


Ve 
0.098698 
0.085102 
0.247676 
0.377436 


“0.270533 


0363787 |. 
“0.255425... 
“1.514654. 
“1.387024... 

0.817739 


Vv. 
~0.01830 
-0.2257° 

0.24798 
“0.10830 
“0.0094. 


You need to standardize the values for Time and for Amount since they can get large. 


Everything else has already been standardized in the data set. Run the code in Figure B-35. 


daf['Amount' 


] = 


StandardScaler() .fit_transform(df['Amount'].values.reshape(-1, 1)) 


df['Time' ] 


StandardScaler() .fit_transform(df['Time'].values.reshape(-1, 1)) 


df .tail() 


Figure B-35. Standardizing the values in the columns Amount and Time 


394 





The output should look somewhat like Figure B-36. 


[3]: 


APPENDIX B 


df[*Time’) = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1i)) 
df.tail() 


Out [3]: 


284802 
284803 
284804 
284805 
284806 


Time 
1.641931 
1.641952 
1.641974 
1.641974 
1.642058 


v1 
-11.881118 
-0.732789 
1.919565 
-0.240440 
-0.533413 


5 rows x 31 columns 


< 


v2 
10.071785 
-0.055080 
-0.301254 
0.530483 
-0.189733 


V3 
-9.834783 
2.035030 
-3.249640 
0.702510 
0.703337 


v4 
-2.066656 
-0.738589 
-0.557828 
0.689799 
-0.506271 


V5 
-5.364473 
0.868229 
2.630515 
-0.377961 
-0.012546 


Figure B-36. The output of the code in figure B-35. 


Now you define your normal and anomaly data sets (see Figure B-37. 


anomalies 


normal = 


anomalies.shape, 


= df[dft["Class"] 


arilari™Class"™ |] 


== 0] 


normal.shape 


V6 
-2.606837 
1.058415 
3.031260 
0.623708 
-0.649617 


Figure B-37. Defining the anomaly and normal data sets 


The output should look like Figure B-38. 


In [4]: 


Out [4]: 


= 


anomalies = 
normal = df[df["Class"] == 0] 


df(df("class"] = 1) 


anomalies.shape, normal.shape 


((492, SL), 


(284315, 31)) 


Figure B-38. The output of the code in Figure B-37 


v7 
-4.918215 
0.024330 
-0.296827 
-0.686180 
1.577006 


v8 
7.305334 
0.294869 
0.708417 
0.679145 
-0.414650 


INTRO TO PYTORCH 


df(*Amount'] = StandardScaler().fit_transform(df["*Amount']).values.reshape(-1, 1)) 


1.91. 
0.58. 
0.43: 
0.39: 
0.481 





395 


APPENDIX B INTRO TO PYTORCH 


After isolating the anomalies from the normal data, let’s create your training and 
testing sets (see Figure B-39. 


for f in range(0, 20): 


normal = normal.iloc[np.random.permutation(len(normal) ) ] 


data set = pd.concat ([normal[:10000], anomalies] ) 


x train, xX test. = train test split(data set, test size = 0.4; 
random state = 42) 


x train = x train.sort values (by=—([*Time* |) 


x test = x test.so0rt values (by=(7 Time") ) 


y train = x train "Class” | 


Vy test. = x test "Class” | 


x train.head(10) 





Figure B-39. The creation of the training and testing data sets 


396 


The output should look somewhat like Figure B-40. 


In [5]: 


Out [5]: 


118 
177 
225 
259 
356 
374 
379 
400 


for f in range(0, 20): 


normal = normal.iloc[np. random. permutation (len(normal) ) ] 


data set = pd.concat([normal[:10000], anomalies)) 


APPENDIX B 


INTRO TO PYTORCH 


x train, x test = train _test_split(data_set, test size = 0.4, random state = 42) 


x train = x train.sort values (by=['Time')) 


x test = x test.sort values (by=['Time']) 


y_ train = x train["Class"] 


y_test = x_test[("Class"] 


x train.head(i0) 


Time 
-1.996436 
-1.995151 
-1.994983 
-1.994182 
-1.993488 
-1.992729 
-1.991087 
-1.990834 
-1.990729 
-1.990476 


V1 
-0.894286 
1.232996 
-0.997176 
1.194066 
-2.687976 
0.726749 
1.260328 
1.124355 
-1.092301 
-0.695818 


10 rows x 31 columns 


V2 
0.286157 
0.189454 
0.228365 

-0.072582 
4.390230 
-0.528042 
0.299161 
-0.132953 
0.430750 
0.581773 


V3 
-0.113192 
0.491040 
1.715340 
0.635286 
-2.360483 
0.050366 
0.527681 
0.583467 
1.249785 
2.378180 


v4 
“0.271526 
0.633673 
-0.420067 
0.768616 
0.360829 
1.373621 
0.614899 
0.804871 
0.429757 
0.063396 


V5 
2.669599 
-0.511574 
0.560838 
-0.735534 
1.310192 
-0.124122 
-0.420592 
-0.726266 
1.272076 
0.329119 


Figure B-40. The output of the code in Figure B-39 


V6 
3.721818 
-0.990609 
0.564725 
-0.673466 
-1.645253 
0.415688 
-0.977533 
-0.521875 
0.548203 
-0.449865 


V7 
0.370145 
0.066240 
0.846047 

-0.146299 
2.327776 
0.259555 
0.108485 

-0.167010 

-0.120592 
1.269104 


v8 
0.851084 
-0.196940 
0.197491 
-0.065653 


=1.727825 


0.085114 
-0.244502 
0.059298 
0.452571 
-0.758363 


-0.392 
0.075 
-0.097 
0.64€ 
4.324 
-0.002 
-0.05€ 
0.368 
-0.414 
0.381 


397 


APPENDIX B INTRO TO PYTORCH 


After defining your data sets, you need to reshape the values so that your neural 


network can accept them (see Figure B-41). 





x Creal = lp.attay(* train) <reshape(x train.shape(C]y Ly 
x train.shape[1]) 


x. LESt = Dpsatray (x Lest) .«reshape(x vtesc.shape!0], 1, 
x test.s.shape|t]) 


¥Y Crain = Np.array(y train) «reshape (y train.shape (0). -, 1) 


y [est = fp.dtray ly tes.) «restapely vtesc.shape!0), 1) 


fe) 


print ("Shapes:\nx train:ss\ny train:ss\n" % (x train.shape, 
y train.shape) ) 


fe) 


print ("x test:ss\ny test:%s\n" % (x test.shape, 
y test.shape) ) 


Figure B-41. Reshaping the training and testing data sets so you can pass them 
into the model 


The output should look like Figure B-42. 


In [6]: | x_train = np.array(x_train).reshape(x train.shape(0], 1, x_train.shape[i)) 
x test = np.array(x_test).reshape(x test.shape[(0)], 1, x_test.shape[i]) 


y_train = np.array(y train).reshape(y train.shape[(0] , 1) 
y_test = np.array(y test).reshape(y test.shape[0], 1) 


print ("Shapes:\nx_train:%s\ny train:%s\n" % (x_train.shape, y train.shape)) 
print ("x _test:%s\ny test:%s\n" % (x_test.shape, y test.shape) ) 


— 


Shapes: 
x_train: (6295, 1, 31) 
y_train: (6295, 1) 


x test: (4197, 1, 31) 
y_test: (4197, 1) 


Figure B-42. The output of the code in Figure B-41 


398 


APPENDIX B INTRO TO PYTORCH 


Now you can define your model (Figure B-43 and Figure B-44). 


class TCN(nn.Module): 
deE init. (seit): 


Super (TCN, Selt}..1nmit_ 4) 


selt.cony 1 = sCOnVviCtl, T2e, kernel. si126=2, cdiberaon=L, 
padding=((2-1) * 1)) 


selt-convy. 2 = “Convilda (le, i2e;, kernel Si26—-2, ciletion=Z, 
padding=((2-1) * 2)) 


sell .cony..2 = sConvid( 126, 126, Kernel Size=2; dilation=4, 
padding=((2-1) * 4)) 


self.conv 4 = “Convila (le, 1.26; Kernel Size-2, diletion=3, 
padding=((2-1) * 8)) 


selrsdense: | nn.Linear(31*128 , 128) 


selredense -2 in. Linear(173, num classes) 





Figure B-43. The first part of the TCN class 


399 


APPENDIX B INTRO TO PYTORCH 


def forward(self, xX): 

x selt.convy 1 (x) 

pS Kt, ty 2>eebLt.cony. J.pedding |) 0) ] 
F.relu(x) 
F,dropout(x, ‘0.05) 
seltT.conv 2{xX) 
ALi, 2% t>-Selivcony 2.pedding (0) 
F.relu (x) 
EF ,dropout (x) 0.05) 
Sselr.cony os) 
Kty tt, 2>-Selrsceny 3.pecding]0)] 
Fi. relia (x) 
F,dropout (sx, 04.05) 
self.conv_ 4 (x) 
Xt, tt, Poselr.convy 4.paedding [0] ] 
F.relu (x) 
F,dropoul (x,. 0,05) 
X.view(-1, 31%*128) 

x Patelu(selt«dernse. 1 (x) 


x selrt..dense 2 (x) 


return F.log softmax(x, dim=1) 





Figure B-44. The forward function in the TCN class 


The code for the model should look like Figure B-45. 


400 


APPENDIX B INTRO TO PYTORCH 


In [is]: i elass TCN(nn.Module): 
2 def init (self): 
super(TCN, self). init ({) 


self.conv_1l = nn.Convid(1, 128, kernel_size=2, dilation=1, padding=((2-1) * 1)) 
self.conv_2 = nn.Convld(128, 128, kernel_size=2Z, dilation=2, padding=((2-1) * 2)) 
seli.conv_3 = nn.Convld(i28, 128, kernel_size=z, dilation=4, padding=((2-1) * 4)) 
seli.cenv_4 = mn.Cenvid(iz8, i20, kernel_size=z, dilation=¢, padding=((2-1) * &)) 
self.dense_1 = nn.binear(3i*128 , 128) 

seli.dense 2 = nn.binear(128, num_classes) 


def forward (self, x): 

self.conv_1(x) 

x[t, #, :-self.conv_1.padding[0] ) 
F.relu(x) - 

F.dropout(x, 0.05) 

self.conv_2(x) 

x[:, 3, :-selfi.conv_2.padding[0]] 
F.relu (x) 

F.dropout(x, 0.05) 

self.conv_3(x) 

x(t, =, :-sel£.conv_3.padding[0)] 
F.relu(x) 

F.dropout(x, 0.05) 

self.conv_4 (x) 

x[:, i, :-self.conv_4.padding[®)] 
»Félu (x) 

-dropout (x, 0.05) 

-View(-1, 31*1268) 
-relu(self.dense_1(x)) 
self.dense_ 2 (x) 

return F.log_softmax(x, dim=1) 


ry My 


my 


Hnnkntnhnnt ht tt tt t tt tow tt 


x x xX RR eR KR KE MN E KE RM RR OM BM 


Figure B-45. The code from Figures B-43 and B-44 in a Jupyter cell. This defines 
the entire model 


Now you can define your training and testing functions (Figure B-46, Figure B-48, 
and Figure B-49). 


401 


APPENDIX B INTRO TO PYTORCH 


def trainimodel, device, x train, y train; Criterion, Optimizer, 


epoch, Save dir="TCN CredivtCard PyTorch.ckpt"): 


Lola, Sven = len(x treatin) 


x train Lorch. Tensor (= 1rain).cuda().tloavt) 


y train Lorch. Tensorly train) scuda() «long () 


x Utrdin. to (device) 


y-Uraim. bo(device) 


# Forward pass 
outputs = model (x train) 


loss = Criverion(culpucs, y train.squeeze(])) 


# Backward and optimize 
optimizer .zero grad () 
loss.backward () 


optimizer.step() 


print ('Epoch {}/{};, lioss: {:.4£}" .format (epoch+1, 
num epochs, loss.item())) 


LOrCh. save (model .slalre: dicr(), save dir) 





Figure B-46. The training function. Since you don’t have data loaders, you pass 
in x_train and y_train directly into the GPU after converting them to tensors. 
The inputs then pass through, and the gradients are calculated. 


402 


APPENDIX B INTRO TO PYTORCH 


In (12): 1 def train(model, device, x train, y train, criterion, optimizer, epoch, save dir="TCN CreditCard PyTorch.ckpt'): 


2 tal_step = len(x_train) 


x train = torch.Tensor(x_train) .cuda() .float{) 
) y_train = torch.Tensor(y train) .cuda().long({) 


x_train.to (device) 
y_train.to (device) 


# Forvard pass 
outputs = model(x_ train) 


12 loss = criterion(ocutputs, y_train.aqueeze (1}} 


i4 # Backvard and optimize 


Optimizer.zero grad() 

16 loss. backward () 

optimizer. step () 

i9 print("Epoch {}/{}, Loss: {:.4£}'.format(epoch+1, num_epochs, loss.item())) 


terch.save(model.state dict(), save _dir) 


Figure B-47. The code from Figure B-46 in a Jupyter cell 


403 


APPENDIX B INTRO TO PYTORCH 


from sklearn.metrics import roc auc _ score 


def test(model, device, x test, y test): 
preds = [] 


y true = [1 


# Set model to evaluation mode. 
model.eval () 
with torch.no grad(): 

correct = 0 


total = 0 


x test torch. Tensor(x test).cuda()-float() 


y_test Lorch. Tensor (y tesc).«cuda()..long() 


x LESt x TEst.co (cevice) 

y tese y_test.to (device) 

y test y test .squeeze (1) 

outputs = model (x test) 

_» predicted = tCorch.max(outputs.data, 1) 

total. += yy Test.517610) 

COPrect += (predicted == -y test) ssum() <item) 
detached pred = predicted.detach().cpu() .numpy () 


delached. label. = -y Test .detach() «cpu () snumpy() 





Figure B-48. The testing function. Since there are no data loaders, the testing sets 
must be converted into a tensor and passed into a GPU before being able to make 
predictions on them. The AUC score is then generated along with an accuracy 
value 


The rest of the testing function code is shown in Figure B-49. 


404 


APPENDIX B INTRO TO PYTORCH 


for f in range(0, len(detached label)): 
preds.append (detached pred) Tf |) 


y true,.append (detached label |[f|) 


print ('Test Accuracy of the model on the 10000 
test images: {:.2%}'.format(correct / total)) 


preds = np.eye (num classes) |[preds | 


y Urue = np.eye (num Classes). ly true! 
auc = TOC auc: Score (np.~round(preds), y true) 


print.("AUC? {reZo}"stormat. (auc) 





Figure B-49. The rest of the testing function. This deals with calculating the AUC 
score and accuracy value 


The entire test function should look like Figure B-50. 


In [218]: i from sklearn.metrics import roc_auc_score 


3 def test(model, device, x_test, y test): 
preds = [] 
y_true = [] 


# Set model to evaluation mode. 
model.eval() 
with torch.no grad(): 

correct = 0 

total = 0 


x_test = torch.Tensor(x test) .cuda().float() 
y_test = torch.Tensor(y test) .cuda().long() 


x_test = x_test.to(device) 

y_test = y test.to(device) 

y_test = y test.squeeze(1) 

outputs = model(x test) 

_- predicted = torch.max(outputs.data, 1) 

total += y test.size(0) 

correct += (predicted == y test) .sum().item() 

detached pred = predicted.detach().cpu().numpy() 

detached label = y test.detach() .cpu() .numpy() 

for f in range(0, len(detached_label)): 
preds.append (detached pred[f)) 
y_true.append (detached label[f]) 


print('Test Accuracy of the model on the 10000 test images: {:.2%}'.format(correct / total)) 
preds = np.eye(num_classes) [preds] 
y_true = np.eye(num_classes) [y true] 


auc = roc auc _score(np.round(preds), y true) 
print("AUC: {:.2%}".format (auc) ) 


Figure B-50. The entire test function, comprised of code from Figures B-48 and B-49 


405 


APPENDIX B INTRO TO PYTORCH 


Finally, you can train our model as shown in Figure B-51. 


model = TCN().to (device) 
criterion = nn.CrossEntropyLoss () 


optimizer = torch.optim.Adam(model.parameters(), 
lr=learning rate) 


## Training phase 


for epoch in range(0, num epochs): 


tlainimModel, Cevice, xX Crain, yy. train, Cricerion, 
optimizer, epoch) 





Figure B-51. Initializing the TCN model, defining the criterion as the cross 
entropy loss, and defining the optimizer (Adam optimizer) 


406 


The output should look somewhat like Figure B-52. 


In [15]: 


model = TCN().to(device) 


criterion 
4 optimizer 


for epoch in range(0, num_epochs): 


Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 
Epoch 


#? Training phase 


nn.CrossEntropyLoss () 
torch.optim.Adam(model.parameters(), lr=learning_rate) 


APPENDIX B 


INTRO TO PYTORCH 


train(model, device, x_train, y_ train, criterion, optimizer, epoch) 


1/30, Loss: 

2/30, Loss: 

3/20, Logs: 

4/30, Loss: 

5/30, Loss: 

6/20, Loge: 

7/30, Loss: 

8/30, Loss: 

9/30, Logs: 

10/30, Loss: 
11/30, Loss: 
12/30, Loss: 
13/30, Loss: 
14/30, Loss: 
15/30, Loss: 
16/30, Loss: 
17/30, Loss: 
18/30, Loss: 
19/30, Loss: 
20/30, Loss: 
21/30, Loss: 
22/30, Loss: 
23/30, Loss: 
24/30, Loss: 
25/30, Loss: 
26/30, Loss: 
27/30, Loss: 
28/30, Loss: 
29/30, Loss: 
30/30, Loss: 


-7319 

-5128 

3454 

- 2430 

1245 

-1041 

-0684 

0737 

.0646 

0.0672 
0.0650 
0.0545 
0.0526 
0.0428 
0.0403 
0.0409 
0.0453 
0.0366 
0.0357 
0.0358 
0.0351 
0.0347 
0.0342 
0.0333 
0.0329 
0.0322 
0.0316 
0.0314 
0.0307 
0.0304 


Figure B-52. The output of the training process 


And now you can evaluate your model (see Figure B-53). 


## Testing phase 


test (model, 


yUest) 


device, 


Figure B-53. Calling the test function 


x test, 





407 


APPENDIX B INTRO TO PYTORCH 


The output should look somewhat like Figure B-54. 


ae. Area a . 
In [16]: e% Testing phase 


test (model, device, x test, y test) 


2 


Test Accuracy of the model on the 10000 test images: 99.14% 
AUC: 98.98% 


Figure B-54. The output AUC value of the testing function 


With the end of this example, you will have created a TCN in both Keras and PyTorch. 
This way, you'll have a good way to compare how the model is built, trained, and 
evaluated in both frameworks, allowing you to observe the similarities and differences in 
how both frameworks handle those processes. 

By now, you should have a better understanding of how PyTorch works, especially 
with how it’s meant to be more intuitive. Think back to the training function and the 
process of converting the data sets, passing them through the GPU and through the 
model, calculated the gradients, and backpropagating. Though it’s not abstracted away 
from you like in Keras, it still makes logical sense in that the functions called directly 


correlate to the training process of a neural network. 


Summary 


PyTorch is a low-level tool that allows you to quickly create, train, and test your 

own deep learning models, although it is more complicated than doing the same in 
Keras. However, it offers you much more functionality, flexibility, and customizability 
compared to Keras, and compared to TensorFlow, it is much lighter on syntax. With 
PyTorch, you don’t have to worry about switching frameworks as you get more advanced 
because of the functionality that it offers, making it a great tool to use when conduct 
deep learning research. PyTorch should be enough for most of your needs as you 
become more experienced with deep learning, and using either PyTorch or TensorFlow 
(or tf.keras + TensorFlow) is just a matter of personal preference. 


408 


Index 


A 


Adam optimizer, 103, 346, 391 
Anomaly detection 
abnormal behavior, 299 
data points 
location, 6 
range of density/tensile 
strength, 5, 6 
sample screw falls, 7, 8 
defined, 298 
example, 298, 299 
taxi cabs, number of pickups, 11-15 
time series, 9, 11 
uses 
data breaches, 20 
identity theft, 21 
manufacturing, 21, 22 
medicine, 22, 23 
networking, 22 
Arctanh function, 361 
Area under the curve of the 
receiver operating 
characteristic (AUROC), 29 
Autoencoders, 302 
activation functions, 131 
anomalies, 140 
CNN 
compile model, 147, 148 
neural network, 145, 146 


© Sridhar Alla, Suman Kalyan Adari 2019 


display encoded images, 148, 149 
import packages, 144, 145 
load MNIST data, 145 
training process, 150-152 
compile model, 132 
confusion matrix, 139 
deep neural network, 142, 143 
importing packages, 127, 128 
latent/compressed representation, 125 
measure anomalies, 137 
neural network, 123, 124 
Pandas dataframe, 129, 130 
precision/recall code, 138 
reconstruction loss, 126 
splitting data, 131 
training process, 132, 134-136 
visualize results via confusion 
matrix, 128, 129 


Banking sector 


autoencoders, 302 
credit card, 302 


Bi-directional encoders, 257 
Boltzmann machine 


bidirectional neural network, 179 
derivations, 180 

generative network, 180 

graph, 180 


409 


S. Alla and S. K. Adari, Beginning Anomaly Detection Using Python-Based Deep Learning, 


https://doi.org/10.1007/978-1-4842-5177-5 


INDEX 


C 


categorical() function, 275 
Confusion matrix, 26, 27, 139 
Context-based anomalies, 16 
Contrastive divergence (CD), 186 
Convolutional neural 
networks (CNN), 85, 144, 304 

Credit card data set 

AUC scores, 193 

free energy vs. probabilities, 195, 196 

modules, import, 187 

normal data points, 193, 194 

output training model, 192 

parameters, 191, 192 

RBM model, 190 

standardized values, 189 

training process, 188 

training/testing sets, 190 
Cybersecurity 

DOS attack, 310 

intrusion activity, 311 

TCP connections, 311 

Trojan, 310 


D 


Data point-based anomalies, 16 
Data science 
accuracy, 28 
AUC score, 32, 33 
AUROCG, 29 
confusion matrix, 26 
definitions, 26 
F1 score, 29 
precision, 28 
recall, 28 
ROC curve, 29 
training data set, 30 


410 


type I error, 26 
type Il error, 26 
Deep belief networks (DBN), 180 
Deep Boltzmann 
machines (DBM), 180 
Deep learning, 309 
artificial neural networks 
activation function, 76 
backpropagation, 81, 83 
cost function, 82 
gradient, 82 
hidden layer, 77, 80 
input layer, 78, 79 
Keras framework, 84 
layers, 76 
learning rate, 83 
mean squared error, 82 
neuron, 74-76 
output layer, 81 
PyTorch, 84 
tensorflow, 84 
GPU, 73 
models, 73 
Deep learning-based anomaly 
detection 
challenges, 317 
key steps, 316, 317 
Denial of service (DOS) attack, 310 
Denoising autoencoder 
compiling model, 157 
depiction, 153 
display encoded images, 159 
evaluate model, 158 
import packages, 154 
load MNIST images, 154 
load/reshape images code, 155 
neural network, 155, 156 
training process, 158, 160-162 


Dilated TCN, anomaly detection 
AUC score, 281, 282 
classification report, 282 
confusion matrix, 282 
data frame, 270, 271, 273 
import packages, 267, 268 
model, defined, 276, 277 
model summary, 278, 279 
shape data sets, 274-276 
sort by Time column, 274 
standardize values, 271, 272 
training process, 280 
visualization class, 268, 269 

Dilated temporal convolutional 

network (TCN) 
acausal network, 266 
anomaly detection (see Dilated TCN, 
anomaly detection) 
causal network, 266, 267 
dilation factor, 262, 263 
feature map, 264 
filter weights, 264 
input vector, 264 
one-dimensional 
convolutions, 264 
output vector, 265 
dilation factor, 262, 263 


E 


ED-TCN, anomaly detection 
AUC score, 294, 295 
decoding stage, 290 
encoding stage, 289 
evaluate performance, 294 
import modules, 286, 287 
model summary, 292 
reshape data sets, 288 


INDEX 


train, data, 293 
zero padding, 292, 293 
Encoder-decoder TCN 
anomaly detection (see ED-TCN, 
anomaly detection) 
decoding stage, 285 
encoding stage, 284 
model structure, 283, 284 
upsampling, 285, 286 
Environmental use case 
air quality index, 303, 304 
deforestation, 303 
Epoch, 86 


F 


Filter/kernel operation, 378, 379 
Finance and insurance industries, 308, 309 


G 


Gradient-based optimization 
technique, 347, 391 


H 


Healthcare, 304-306 


J 
inception-v3 model, 259 
Isolation forest 
mutant fish, 34, 35 
works 
calculate AUC, 49 
categorical values, 44 
data sets, 39, 40 
filtering df, 41 


4ll 


INDEX 


Isolation forest (cont. ) 


histogram, 50, 51 

KDDCUP 1999 data set, 36, 37 
label encoder, 42-44 
Matplotlib, 38 

numpy module, 37, 38 
Pandas module, 38 
parameters, 47 

scikit-learn, 38 

training set, 45 

validation set, 46 


KDDCUP data set 


anomalies vs. normal data points, 211 
anomalous data, 201, 210 
AUC scores, 203, 209 

define column, 198 

exploding gradient, 205 

free energy vs. probability, 211 
HTTP attacks, 199 

Jupyter cell, 204 

label encoder, 199, 200, 204 
modules, import, 197 

output, 201-203, 206 

training output, 207, 208 
training/testing sets, 205 
unsupervised training, 206 


Keras, 84 


activation function, 331 

activation map/feature map, 95 

adam optimizer, 346, 347 

AUC score, 107, 109 

back end (TensorFlow 
operations), 358, 359 

binary accuracy, 343, 344 

categorical accuracy, 344, 345 


412 


CNN, 85 

compiling model, 94 

data set, 87 

deep learning model, 319 

dense layer, 102, 329, 330 

dropout layer, 101, 331, 332 

epoch, 86 

evaluate function, 327 

file path, 328 

filter, 96-99 

flatten layer, 332, 333 

functional model, 321 

image properties, 89, 90 

input layer, 328, 329 

matplotlib, 86 

Max pooling, 100, 339-340 

min-max normalization, 90-92 

MNIST dataset, 85 

ModelCheckpoint, 351, 352 

model compilation/training 
ModelCheckpoint(), 323 
model.fit() function, 324 
parameters, 322, 323 
verbosity, 324, 325 

model evaluation/prediction, 326 

normalization/feature scaling, 90 

one-dimensional convolutional 

layer, 334, 335 

parameters, 326 

pooling layer, 101 

prediction function, 327 

ReLU function, 102, 103 

RLU, 349, 350 

RMSprop, 347, 348 

sequential model, 95, 321 

sigmoid activation, 350, 351 

softmax activation, 348 

Spatial Dropout, 333, 334 


standardization, 91 
TensorBoard (see TensorBoard) 
TensorFlow/PyTorch, 319, 320 
training data, 105 
transformed data, 93 
2D convolutional layer, 336, 337 
Unit length scaling, 91 
vector representation, 92, 93 
ZeroPadding, 338, 339 

Kernel trick, 61 


L 


Label encoder, 42-44 
Long Short-Term Memory (LSTM) models 
activation function, 219, 220 
anomalies, 242 
anomaly detection 
adam optimizer, 230 
dataframe, 230 
dataset, 226 
errors, 224 
import packages, 223, 224 
plotting time series, 227 
value column, 228, 229 
visualize errors, 225 
arguments, 231, 232 
compute threshold, 240, 241 
dataframe, 242 
definition, 218 
linear/non-linear data plots, 220 
RNN, 216, 217 
sequence/time series, 213-215 
sigmoid activation function, 221, 222 
tanh function, 219 
testing dataset, 239 
time series, examples 
ambient_temperature_system_ 
failure, 251-254 


INDEX 


art_daily_jumpsdown, 246-248 
art_daily_nojump, 244-246 
art_daily_no_noise, 243, 244 
art_daily_perfect_square_ 
wave, 248-250 
art_load_balancer_spikes, 250, 251 
rds_cpu_utilization, 254, 255 
training process, 235-238 
Loss functions 
cross entropy loss, 388, 389 
Keras 
categorical cross entropy, 341 
mean squared error, 340, 341 
sparse categorical cross 
entropy, 342, 343 
MSE, 387, 388 


Manufacturing sector 
automation, 313 
sensors, 313, 314 
Matplotlib, 38 
Mean normalization, 91 
Mean squared error, 82, 230 
Mean squared loss (MSE), 387 
Modified National Institute of 
Standards and Technology 
(MNIST), 85, 392 
Momentum, 187, 346 


N 


Nesterov momentum, 346, 390 
Noise removal, 18 
Normalization/feature 
scaling, 90 
novelties.head(), 202 
Novelty detection, 18, 19, 51 
A413 


INDEX 


O 


One class SVM 
data points, 58, 59 
gamma, 61 
hyperplane, 54-57 
kernel, 61 
novelties, 62 
regularization, 61 
semi-supervised anomaly detection, 51 
support vector, 54 
visualize data, 52, 53 
works 
accuracy, 67-69 
AUC score, 69, 70 
categorical values, 64 
data points, 71 
data sets shapes, 65, 66 
filtering data, 64 
importing modules, 63 
KDDCUP 1999 data set, 63 
label encoder, 65 
model, 66 
Optimizers 
adam, 391 
RMSprop, 391, 392 
SGD, 390 
Outlier detection, 18 


P,Q 


Pattern-based anomalies, 17 
Persistent contrastive divergence (PCD), 186 
Probability function, 183, 184 
PyTorch, 84 
AUC score, 119, 121 
compatibility, 362 
creating CNN, 115-117 
creating model, 114 


414 


deep learning library, 361 
hyperparameters, 112, 371, 372 
Jupyter cell, 365, 367, 369, 371, 374 
layers 
Convld, 377, 378 
Conv2d, 378, 379 
dropout, 382 
linear, 380 
log _softmax, 385, 386 
MaxPooling1D, 380 
MaxPooling2D, 381 
ReLU, 383, 384 
sigmoid function, 386, 387 
softmax, 384, 385 
ZeroPadding2D, 381 
loss functions, 387 
low-level language, 361 
model, 366 
network creation, 365 
optimizer, 373 
sequential vs. modulelist, 376, 377 
temporal convolutional network, 393 
TensorFlow, 361 
tensor operations, 362-364 
testing, 369-370 
training 
algorithm, 368 
data, 118 
function, 369 
process, 119, 375 
training/testing data sets, 113 


R 


Receiver operating characteristic (ROC) 
curve, 29 

Rectified Linear Unit (ReLU), 102, 131, 
349, 350 


Recurrent neural network (RNN), 216, 257 
Restricted boltzmann 
machine (RBM), 180 
credit card data set (see Credit card 
data set) 
energy function, 182 
expected value, 186 
KDDCUP data set (see KDDCUP data 
set) 
layers, 181 
probability function, 183, 184 
sigmoid function, 185 
unsupervised learning algorithm, 186 
vector vs. transposed vector, 183 
visual representation, 181, 182 
Retail industry, 315, 316 


S 


Scikit-learn, 38, 42 
Semi-supervised anomaly 
detection, 19, 51 
Smart home system, 315 
Social media platforms, 307, 308 
Softmax, 131 
Sparse autoencoders, 140-142 
Standardization (z-score 
normalization), 91 
Stochastic gradient descent (SGD), 345, 
346, 390 
Supervised anomaly detection, 
19, 262, 283 
Support vector machine (SVM), 53, 61 


7 


tanh function, 219, 222 
Telecom sector 


INDEX 


roaming, 300 
service disruption, 300, 301 
TCN or LSTM algorithms, 301 
Temporal convolutional 
networks (TCNs) 
advantages, 258 
anomaly/normal data, 395 
data set, 394 
defined, 257 
disadvantages, 258, 259 
import modules, 393 
Jupyter cell, 393, 401 
one-dimensional operation 
dilation, 262 
input vector, 259, 260 
output vector, 260, 261 
standard values, 394, 395 
TCN class, 399, 400, 406 
testing function, 404, 405, 407, 408 
training function, 402 
training/testing sets, 396-398 
TensorBoard 
command prompt, 354 
graph, 357, 358 
MNIST data set, 353 
parameters, 352, 353 
val_acc/val_loss, 356 
TensorFlow, 84, 113, 121, 320 
train_test_split function, 46, 274 
Transfer learning, 259 
Transportation sector, 306, 307 


U 


Unit length scaling, 91 

Unsupervised anomaly 
detection, 19, 34 

Upsampling, 285, 337 


415 


INDEX 


V, W, X, Y, Z import packages, 165, 166 
neural network, 164, 170, 171 


Pandas dataframe, 168 


Variational autoencoder 
anomalies, 175 


confusion matrix, 173, 174 results via confusion matrix, 167 
definition, 163 training process, 172, 173, 176, 177 
distribution code, 169 Video surveillance, 312, 313 


416 


